mindie_qwen2.5推理适配
虽然mindie暂未宣布支持Qwen2.5,但是Qwen2.5和Qwen2模型结构一致,个人理解可直接按照qwen2的方式做迁移部署.
1.准备
模型权重
/path/to/Qwen2p5-72B-Instruct/
运行环境
本次验证使用mindieT65版本
如果只需要启动服务,可直接跳至2.3章节
2.适配验证
启动容器
docker run --rm -it -u root --name=mindie_t65 --net=host --privileged=true -w /opt --device=/dev/davinci_manager --device=/dev/devmm_svm --device=/dev/hisi_hdc -v /etc/ascend_install.info:/etc/ascend_install.info -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ -v /usr/local/sbin/:/usr/local/sbin/ -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi -v /var/log/npu/slog/:/var/log/npu/slog -v /var/log/npu/profiling/:/var/log/npu/profiling -v /var/log/npu/dump/:/var/log/npu/dump -v /var/log/npu/:/usr/slog -v /etc/hccn.conf:/etc/hccn.conf -v /etc/localtime:/etc/localtime:ro -v /host_model_path/:/opt/files -v /tmp:/tmp mindie:t65 /bin/bash
2.1 推理加速框架验证
ATB推理
atb_models的代码路径会随着镜像版本变换,该镜像的路径是:/usr/local/Ascend/atb_models/
在${llm_path}
目录执行以下指令
bash examples/models/qwen/run_pa.sh -m ${weight_path}
注:
1.启动量化推理时,请在权重路径的config.json文件中添加(或修改)quantize
字段,值为相应量化方式,例如"quantize": "w8a8"
、"quantize": "w8a16"
2.对于chat模型需要开启chat模式才能正常输出。
执行:
bash examples/models/qwen/run_pa.sh -m ${weight_path} -c true
run_pa.py脚本参数介绍
- 脚本:
${llm_path}/examples/run_pa.py
- 功能:Paged Attention场景下模型推理的启动脚本
- 参数说明
参数名称 是否为必选 类型 默认值 描述model_path 是 string 模型权重路径 input_texts 否 string [“What’s deep learning?”] 推理文本,多条推理文本间使用空格分割 input_ids 否 string None 推理文本经过模型分词器处理后得到的token id,多条推理请求间使用空格分割,单个推理请求内每个token使用逗号隔开 input_file 否 jsonl格式文件 None 包含多轮对话文本的文件。
仅支持jsonl格式文件,每一行必须为List[Dict]格式的按时间顺序排序的对话数据,每个Dict字典中需要至少包含"role"和"content"两个字段input_dict 否 string None 推理文本以及对应的adapter名称。格式形如:‘[{“prompt”: “A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?”, “adapter”: “adapter1”}, {“prompt”: “What is deep learning?”, “adapter”: “base”}]’ max_prefill_batch_size 否 int None 模型推理最大Prefill Batch size max_batch_size 否 int 1 模型推理最大Batch size max_input_length 否 int 1024 推理文本最大token数 max_output_length 否 int 20 推理结果最大token数 max_position_embeddings 否 int或者None None 模型可接受的最大上下文长度。当此值为None时,则从模型权重文件中读取 max_prefill_tokens 否 int -1 模型Prefill推理阶段最大可接受的token数。若输入为-1,则max_prefill_tokens = max_batch_size * (max_input_length + max_output_length) block_size 否 int 128 Kv cache分块存储,每块存储的最大token数,默认为128 chat_template 否 string None 对话模型的prompt模板 ignore_eos 否 bool store_true 当推理结果中遇到eos token(句子结束标识符)时,是否结束推理。若传入此参数,则忽略eos token is_chat_model 否 bool store_true 是否支持对话模式。若传入此参数,则进入对话模式 is_flash_model 否 bool store_false 是否运行Paged Attention,默认运行Paged Attention,若传入此参数,则运行Flash Attention is_embedding_model 否 bool store_true 是否为embedding类模型。默认为因果推断类模型,若传入此参数,则为embedding类模型 load_tokenizer 否 string “True” 是否加载tokenizer。若传入False,则必须传入input_ids参数,且推理输出为token id enable_atb_torch 否 bool store_true 是否使用Python组图。默认使用C++组图,若传入此参数,则使用Python组图 kw_args 否 string “” 扩展参数,支持用户通过扩展参数进行功能扩展
2.2 精度/性能测试
脚本路径:atb_models/tests/modeltest/
示例:
bash run.sh pa_fp16 full_BoolQ 1 qwen ${Qwen2.5-72b权重路径} 8
详细文档可参考atb_models/test/modeltest/readme.md
2.3 推理服务框架验证
使用mindie-service推理服务部署框架
注:mindie版本不同,推理服务配置文件变化较大!!!
推理服务配置文件修改
路径:$MIES_INSTALL_PATH/conf/config.json
修改:
{
"Version": "1.0.0",
"LogConfig" :
{
"logLevel" : "Info",
"logFileSize" : 20,
"logFileNum" : 20,
"logPath" : "logs/mindservice.log"
},
"ServerConfig" :
{
"ipAddress" : "127.0.0.1",
"managementIpAddress": "127.0.0.1",
"port" : 31003,
"managementPort" : 31003,
"maxLinkNum" : 200,
"httpsEnabled" : false,
"fullTextEnabled" : false,
"maxHeaderLen" : 512,
"tlsCaPath" : "security/ca/",
"tlsCaFile" : ["ca.pem"],
"tlsCert" : "security/certs/server.pem",
"tlsPk" : "security/keys/server.key.pem",
"tlsPkPwd" : "security/pass/mindie_server_key_pwd.txt",
"kmcKsfMaster" : "tools/pmt/master/ksfa",
"kmcKsfStandby" : "tools/pmt/standby/ksfb",
"tlsCrl" : "security/certs/server_crl.pem"
},
"BackendConfig": {
"backendName" : "mindieservice_llm_engine",
"tokenizerProcessNumber" : 8,
"multiNodesInferEnabled": false,
"multiNodesInferPort": 1120,
"interNodeTLSEnabled": true,
"interNodeTlsCaFile": "security/ca/ca.pem",
"interNodeTlsCert": "security/certs/server.pem",
"interNodeTlsPk": "security/keys/server.key.pem",
"interNodeTlsPkPwd": "security/pass/mindie_server_key_pwd.txt",
"interNodeKmcKsfMaster": "tools/pmt/master/ksfa",
"interNodeKmcKsfStandby": "tools/pmt/standby/ksfb",
"ModelDeployConfig":
{
"modelInstanceNumber" : 1,
"maxSeqLen" : 8192,
"maxInputTokenLen" : 4096,
"truncation" : false,
"npuDeviceIds" : [[0,1,2,3]],
"ModelConfig" : [
{
"modelInstanceType": "Standard",
"modelName" : "qwen2p5_72b_baseline",
"modelWeightPath" : "/app/appdata/",
"worldSize" : 4,
"cpuMemSize" : 5,
"npuMemSize" : 16,
"backendType": "atb"
}
]
},
"ScheduleConfig":
{
"templateType": "Standard",
"templateName" : "Standard_llama",
"cacheBlockSize" : 128,
"maxPrefillBatchSize" : 24,
"maxPrefillTokens" : 8192,
"prefillTimeMsPerReq" : 60,
"prefillPolicyType" : 0,
"decodeTimeMsPerReq" : 60,
"decodePolicyType" : 0,
"maxBatchSize" : 24,
"maxIterTimes" : 2048,
"maxPreemptCount" : 0,
"supportSelectBatch" : false,
"maxQueueDelayMicroseconds" : 5000
}
}
}
各参数的作用参考:
https://www.hiascend.com/document/detail/zh/mindie/10RC2/mindieservice/servicedev/mindie_service0004.html
服务启动
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/mindie/set_env.sh
source /usr/local/Ascend/mindie/latest/mindie-service/set_env.sh
source /usr/local/Ascend/mindie/latest/mindie-llm/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/atb_models/set_env.sh
export HOST_IP=1.1.1.1
export MIES_CONTAINER_MANAGEMENT_IP=$HOST_IP
export MIES_CONTAINER_IP=$HOST_IP
cd /usr/local/Ascend/mindie/latest/mindie-service/bin
./mindieservice_daemon
服务测试
curl --location ‘http://ip:port/v1/chat/completions’ --header ‘Content-Type: application/json’ --data ‘{ “model”:“test”, “messages”: [ { “role”: “user”, “content”: “你好。” } ], “max_tokens”:2}’