0x1 起始
最近学校购置了一批采用昇腾910B2作为NPU的鲲鹏服务器,由于目前没有很好的微调想法或者工作,于是打算先把这些机器拿来跑推理任务,这样还可以省一批消费API的经费。基于这个目的我于是开始了探索,发现网上并没有关于此的完整记录,故撰写本文
由于本校直接采用的是基于天翼云的息壤一体化智算服务平台,一些例如CANN和NPU-Driver之类的底层物理机环境无需我考虑。于是我直接从怎么获取MindIE镜像开始
我的机器应该是对应Atlas 800I A2这个型号,采用的是昇腾910B2的NPU,这点我们可以输入
npu-smi info
可以获取到如下的信息,我此次申请的是8卡的机器
+------------------------------------------------------------------------------------------------+
| npu-smi 23.0.3 Version: 23.0.3 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 0 910B2 | OK | 97.0 53 0 / 0 |
| 0 | 0000:C1:00.0 | 0 0 / 0 53701/ 65536 |
+===========================+===============+====================================================+
| 1 910B2 | OK | 101.8 57 0 / 0 |
| 0 | 0000:01:00.0 | 0 0 / 0 53698/ 65536 |
+===========================+===============+====================================================+
| 2 910B2 | OK | 99.5 54 0 / 0 |
| 0 | 0000:C2:00.0 | 0 0 / 0 53699/ 65536 |
+===========================+===============+====================================================+
| 3 910B2 | OK | 102.7 57 0 / 0 |
| 0 | 0000:02:00.0 | 0 0 / 0 53698/ 65536 |
+===========================+===============+====================================================+
| 4 910B2 | OK | 106.9 54 0 / 0 |
| 0 | 0000:81:00.0 | 0 0 / 0 3331 / 65536 |
+===========================+===============+====================================================+
| 5 910B2 | OK | 99.1 55 0 / 0 |
| 0 | 0000:41:00.0 | 0 0 / 0 3331 / 65536 |
+===========================+===============+====================================================+
| 6 910B2 | OK | 97.1 56 0 / 0 |
| 0 | 0000:82:00.0 | 0 0 / 0 3331 / 65536 |
+===========================+===============+====================================================+
| 7 910B2 | OK | 101.3 59 0 / 0 |
| 0 | 0000:42:00.0 | 0 0 / 0 3331 / 65536 |
+===========================+===============+====================================================+
按照套路我们还是简单介绍一下什么是MindIE。根据官方的文档如下:
MindIE(Mind Inference Engine,昇腾推理引擎)是华为昇腾针对AI全场景业务的推理加速套件。
简单来说只是部署推理服务就用MindIE,如果要预训练或者微调可以使用MindSpeed-LLM
MindIE 目前支持的 LLM 列表可以从如下的网址进行查看
https://www.hiascend.com/document/detail/zh/mindie/100/whatismindie/mindie_what_0003.html
0x2 镜像准备
MindIE的Docker镜像可以从如下的网址进行获取,由于众所周知的原因需要申请,不过不怎么卡,大致1~2天即可获取审批
https://www.hiascend.com/developer/ascendhub/detail/af85b724a7e5469ebd7ea13c3439d48f
我选择的是 1.0.T71-800I-A2-py311-ubuntu22.04-arm64 这个版本,在获取到镜像后我建议封装一个Code-Server进去,对于后面部署比较方便,这里我参考了某位同学的文档。我们推荐自己从Github先下载好Linux的Code-Sever的Release包
https://github.com/coder/code-server/releases
接下来我们需要重新打包镜像,为此我们需要编写一个dockerfile
# 这个地方名字改成你的镜像,大概率跟我这个名字一样
FROM swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:1.0.T71-800I-A2-py311-ubuntu22.04-arm64
# VERSION改成你下载的版本,这个地方不设置会发生什么我不知道
ENV DEBIAN_FRONTEND=noninteractive
ENV VERSION=4.96.2
RUN apt-get update && apt-get install -y \
tar \
sudo \
&& rm -rf /var/lib/apt/lists/*
# 将把你下载的 code-server 文件添加到镜像中
COPY code-server-4.96.2-linux-arm64.tar.gz /tmp/code-server.tar.gz
RUN mkdir -p /home/root/.local/lib /home/root/.local/bin && \
tar -xzf /tmp/code-server.tar.gz -C /home/root/.local/lib && \
mv /home/root/.local/lib/code-server-$VERSION-linux-arm64 /home/root/.local/lib/code-server-$VERSION && \
ln -s /home/root/.local/lib/code-server-$VERSION/bin/code-server /home/root/.local/bin/code-server && \
rm /tmp/code-server.tar.gz
ENV PATH="/home/root/.local/bin:$PATH"
RUN mkdir -p /home/root/.config/code-server && \
echo "bind-addr: 0.0.0.0:8080" > /home/root/.config/code-server/config.yaml && \
echo "auth: none" >> /home/root/.config/code-server/config.yaml
EXPOSE 8080
CMD ["code-server", "--config", "/home/root/.config/code-server/config.yaml"]
接下来我们就使用这个镜像启动,请注意启动之后不要升级任何组件以避免不必要的兼容性问题
0x3 测试部署MindIE
我们首先需要获取到LLM模型的权重文件,在国内的话我们一般使用modelscope社区获取。我们可以从modelscope社区查询所需要的LLM模型的名字,然后使用如下的脚本进行下载
from modelscope.hub.snapshot_download import snapshot_download
model_name = "Qwen/Qwen2.5-14B-Instruct"
# 指定保存模型的本地路径
local_dir = 'Qwen2.5-14B-Instruct'
# 下载模型
snapshot_path = snapshot_download(model_name, local_dir=local_dir)
print(f'Model downloaded to: {snapshot_path}')
当获取到模型后,我们进入模型路径并修改权限
cd Qwen2.5-14B-Instruct
chmod -R 750 ./*
然后我们对 ATB Models 进行测试,这里参考的是如下的文档:
https://www.hiascend.com/document/detail/zh/mindie/100/mindiellm/llmdev/mindie_llm0009.html
配置环境变量
# 配置CANN环境,默认安装在/usr/local目录下
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 配置加速库环境
source /usr/local/Ascend/nnal/atb/set_env.sh
# 配置模型仓环境变量
source /usr/local/Ascend/llm_model/set_env.sh
测试对话推理
cd ${ATB_SPEED_HOME_PATH}
bash examples/models/qwen/run_pa.sh -m /root/Qwen2.5-14B-Instruct
默认推理内容为"What’s deep learning?",batch size为1,如果正确的话应该会返回如下的内容:
[2025-01-18 06:08:49,033] [51235] [281460059148384] [llm] [INFO][logging.py-227] : ---------------begin warm_up---------------
[2025-01-18 06:08:49,033] [51235] [281460059148384] [llm] [INFO][cache.py-102] : kv cache will allocate 0.0263671875GB memory
[2025-01-18 06:08:49,035] [51235] [281460059148384] [llm] [INFO][logging.py-227] : ------total req num: 1, infer start--------
[2025-01-18 06:08:49,042] [51241] [281463369568352] [llm] [INFO][cache.py-102] : kv cache will allocate 0.0263671875GB memory
[2025-01-18 06:08:49,369] [51237] [281467581304928] [llm] [INFO][cache.py-102] : kv cache will allocate 0.0263671875GB memory
[2025-01-18 06:08:49,482] [51236] [281468636631136] [llm] [INFO][cache.py-102] : kv cache will allocate 0.0263671875GB memory
[2025-01-18 06:08:49,523] [51238] [281466744803424] [llm] [INFO][cache.py-102] : kv cache will allocate 0.0263671875GB memory
[2025-01-18 06:08:49,821] [51235] [281460059148384] [llm] [INFO][logging.py-227] : <<<<<<< ori k_caches[0].shape=torch.Size([9, 128, 1, 128])
[2025-01-18 06:08:49,822] [51235] [281460059148384] [llm] [INFO][logging.py-227] : >>>>>>id of kcache is 281457212115664 id of vcache is 281457212117200
[2025-01-18 06:08:50,228] [51239] [281459403984992] [llm] [INFO][cache.py-102] : kv cache will allocate 0.0263671875GB memory
[2025-01-18 06:08:51,617] [51242] [281469175402592] [llm] [INFO][cache.py-102] : kv cache will allocate 0.0263671875GB memory
[2025-01-18 06:08:52,029] [51240] [281463613362272] [llm] [INFO][cache.py-102] : kv cache will allocate 0.0263671875GB memory
[2025-01-18 06:08:54,582] [51235] [281460059148384] [llm] [INFO][logging.py-227] : warmup_memory(GB): 8.32
[2025-01-18 06:08:54,582] [51235] [281460059148384] [llm] [INFO][logging.py-227] : ---------------end warm_up---------------
[2025-01-18 06:08:54,582] [51235] [281460059148384] [llm] [INFO][logging.py-227] : ---------------begin inference---------------
[2025-01-18 06:08:54,614] [51235] [281460059148384] [llm] [INFO][logging.py-227] : ------total req num: 1, infer start--------
[2025-01-18 06:08:55,016] [51235] [281460059148384] [llm] [INFO][logging.py-227] : ---------------end inference---------------
[2025-01-18 06:08:55,016] [51235] [281460059148384] [llm] [INFO][logging.py-227] : Answer[0]: Deep learning is a subset of machine learning that involves the use of artificial neural networks to model and solve
[2025-01-18 06:08:55,016] [51235] [281460059148384] [llm] [INFO][logging.py-227] : Generate[0] token num: (0, 20)
确认无误后,我们开始正式部署MindIE的服务
cd $MIES_INSTALL_PATH
vim conf/config.json
首先把https的选项给关闭
"httpsEnabled" : false,
然后修改模型的配置选项
"ModelDeployConfig" :
{
"maxSeqLen" : 2560,
"maxInputTokenLen" : 2048,
"truncation" : false,
"ModelConfig" : [
{
"modelInstanceType" : "Standard",
"modelName" : "Qwen25-14B",
"modelWeightPath" : "/root/Qwen2.5-14B-Instruct",
"worldSize" : 4,
"cpuMemSize" : 5,
"npuMemSize" : -1,
"backendType" : "atb",
"trustRemoteCode" : true
}
]
},
保存退出,运行一下服务进行测试
./bin/mindieservice_daemon
我们另开一个shell发一个请求进行测试
curl -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{
"model": "Qwen25-14B",
"messages": [{
"role": "user",
"content": "Hello"
}],
"stream": false
}'http://127.0.0.1:1025/v1/chat/completions
如果一切工作正常应该可以得到回复
{"id":"endpoint_common_1","object":"chat.completion","created":1737181163,"model":"Qwen25-14B","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! How can I assist you today?","tool_calls":null},"finish_reason":"stop"}],"usage":{"prompt_tokens":30,"completion_tokens":10,"total_tokens":40},"prefill_time":236,"decode_time_arr":[109,18,17,16,16,17,17,16,17]}
0x4 正式部署
如果要正式部署的话,应该初始一下环境变量
# 配置CANN环境,默认安装在/usr/local目录下
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 配置加速库环境
source /usr/local/Ascend/nnal/atb/set_env.sh
# 配置模型仓环境变量
source /usr/local/Ascend/llm_model/set_env.sh
# MindIE
source /usr/local/Ascend/mindie/latest/mindie-llm/set_env.sh
source /usr/local/Ascend/mindie/latest/mindie-service/set_env.sh
然后应该采用这样的后台命令
nohup ./bin/mindieservice_daemon > output.log 2>&1 &
如果要看日志,此时只需:
tail -f output.log
如果看到 Daemon start success! 则说明部署成功,可以推出 tail 了
如果你使用了 Visual Code 可以发现自动为你提供了端口转发的功能,十分方便,而不用去部署FRP之类的服务
这个时候在外网我们可以修改一下之前的测试请求
curl -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{
"model": "Qwen25-14B",
"messages": [{
"role": "user",
"content": "Hello"
}],
"stream": false
}' https://cloudwarrior-ai.ctyun.cn/central/200000001852/vscode/562e1d4f-b372-4809-b99d-f130640a770f/e3feec1f-94b2-492f-9faf-d22b1b3fa9e5/proxy/1025/v1/chat/completions
我还是用了 NextChat 进行测试,配置如下填写:
- 接口地址:直接从Visual Code里复制转发地址
- 模型名:即填入你部署服务时使用的模型名称
可以正确的进行问答
补充配置文件
我在这里稍微补充了一下我使用的QWEN2.5-72B部署信息,主要修改了32K的上下文长度和8K的生成窗口并且支持8卡推理
{
"Version" : "1.0.0",
"LogConfig" :
{
"logLevel" : "Info",
"logFileSize" : 20,
"logFileNum" : 20,
"logPath" : "logs/mindie-server.log"
},
"ServerConfig" :
{
"ipAddress" : "127.0.0.1",
"managementIpAddress" : "127.0.0.2",
"port" : 1025,
"managementPort" : 1026,
"metricsPort" : 1027,
"allowAllZeroIpListening" : false,
"maxLinkNum" : 1000,
"httpsEnabled" : false,
"fullTextEnabled" : false,
"tlsCaPath" : "security/ca/",
"tlsCaFile" : ["ca.pem"],
"tlsCert" : "security/certs/server.pem",
"tlsPk" : "security/keys/server.key.pem",
"tlsPkPwd" : "security/pass/key_pwd.txt",
"tlsCrlPath" : "security/certs/",
"tlsCrlFiles" : ["server_crl.pem"],
"managementTlsCaFile" : ["management_ca.pem"],
"managementTlsCert" : "security/certs/management/server.pem",
"managementTlsPk" : "security/keys/management/server.key.pem",
"managementTlsPkPwd" : "security/pass/management/key_pwd.txt",
"managementTlsCrlPath" : "security/management/certs/",
"managementTlsCrlFiles" : ["server_crl.pem"],
"kmcKsfMaster" : "tools/pmt/master/ksfa",
"kmcKsfStandby" : "tools/pmt/standby/ksfb",
"inferMode" : "standard",
"interCommTLSEnabled" : true,
"interCommPort" : 1121,
"interCommTlsCaPath" : "security/grpc/ca/",
"interCommTlsCaFiles" : ["ca.pem"],
"interCommTlsCert" : "security/grpc/certs/server.pem",
"interCommPk" : "security/grpc/keys/server.key.pem",
"interCommPkPwd" : "security/grpc/pass/key_pwd.txt",
"interCommTlsCrlPath" : "security/grpc/certs/",
"interCommTlsCrlFiles" : ["server_crl.pem"],
"openAiSupport" : "vllm"
},
"BackendConfig" : {
"backendName" : "mindieservice_llm_engine",
"modelInstanceNumber" : 1,
"npuDeviceIds" : [[0,1,2,3,4,5,6,7]],
"tokenizerProcessNumber" : 8,
"multiNodesInferEnabled" : false,
"multiNodesInferPort" : 1120,
"interNodeTLSEnabled" : true,
"interNodeTlsCaPath" : "security/grpc/ca/",
"interNodeTlsCaFiles" : ["ca.pem"],
"interNodeTlsCert" : "security/grpc/certs/server.pem",
"interNodeTlsPk" : "security/grpc/keys/server.key.pem",
"interNodeTlsPkPwd" : "security/grpc/pass/mindie_server_key_pwd.txt",
"interNodeTlsCrlPath" : "security/grpc/certs/",
"interNodeTlsCrlFiles" : ["server_crl.pem"],
"interNodeKmcKsfMaster" : "tools/pmt/master/ksfa",
"interNodeKmcKsfStandby" : "tools/pmt/standby/ksfb",
"ModelDeployConfig" :
{
"maxSeqLen" : 32768,
"maxInputTokenLen" : 32767,
"truncation" : false,
"ModelConfig" : [
{
"modelInstanceType" : "Standard",
"modelName" : "qwen25_72b",
"modelWeightPath" : "/root/Qwen2.5-72B-Instruct",
"worldSize" : 8,
"cpuMemSize" : 10,
"npuMemSize" : -1,
"backendType" : "atb",
"trustRemoteCode" : false
}
]
},
"ScheduleConfig" :
{
"templateType" : "Standard",
"templateName" : "Standard_LLM",
"cacheBlockSize" : 128,
"maxPrefillBatchSize" : 50,
"maxPrefillTokens" : 32768,
"prefillTimeMsPerReq" : 150,
"prefillPolicyType" : 0,
"decodeTimeMsPerReq" : 50,
"decodePolicyType" : 0,
"maxBatchSize" : 200,
"maxIterTimes" : 8192,
"maxPreemptCount" : 0,
"supportSelectBatch" : false,
"maxQueueDelayMicroseconds" : 5000
}
}
}