[昇腾推理优化] 基于昇腾910B的mooncake组件部署指导手册

原创已于 2025-12-27 14:38:28 修改 · 1k 阅读

30 ·

CC 4.0 BY-SA版权

文章标签：

#gpu算力 #语言模型

于 2025-12-08 16:18:55 首次发布

llm 同时被 2 个专栏收录

16 篇文章

订阅专栏

大模型训推

11 篇文章

订阅专栏

解决主要问题：

vllm-ascend官方镜像环境不匹配问题，
mooncake配置
vllm-ascend启动环境配置

并在最后进行了效果评价，效果基本符合预期。

详细配置步骤如下：

1.mooncake环境搭建

1.1 mooncake安装

需要下载安装包，编辑安装环境。直接pip不可用

git clone https://github.com/kvcache-ai/Mooncake.git

apt-get install mpich libmpich-dev -y

cd Mooncake
bash dependencies.sh -y
mkdir build
cd build
cmake .. -DUSE_ASCEND_DIRECT=ON    ##昇腾环境要注意
make -j
make instal

安装成功检测

import mooncake
无报错

1.2 mooncake启动

##启动命令
mooncake_master --port 50088 --eviction_high_watermark_ratio  0.9 --eviction_ratio 0.05  --rpc_thread_num 32 --metrics_port 10022

##log
WARNING: Logging before InitGoogleLogging() is written to STDERR
W20251205 13:07:34.889658 162445 master.cpp:133] port is deprecated, use rpc_port instead
I20251205 13:07:34.889760 162445 master.cpp:296] Master service started on port 50058, enable_gc=0, max_threads=32, enable_metric_reporting=1, metrics_port=9003, default_kv_lease_ttl=5000, default_kv_soft_pin_ttl=1800000, allow_evict_soft_pinned_objects=1, eviction_ratio=0.05, eviction_high_watermark_ratio=0.9, enable_ha=0, etcd_endpoints=, client_ttl=10, rpc_thread_num=32, rpc_port=50058, rpc_address=0.0.0.0, rpc_conn_timeout_seconds=0, rpc_enable_tcp_no_delay=1, cluster_id=mooncake_cluster, memory_allocator=offset
I20251205 13:07:34.908459 162445 rpc_service.cpp:181] HTTP metrics server started on port 9003
I20251205 13:07:34.908710 162453 rpc_service.cpp:49] Master Metrics: Storage: 0 B / 0 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): PutStart=0/0, PutEnd=0/0, PutRevoke=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0,  | Batch Requests (Req=Success/PartialSuccess/Total, Item=Success/Total): PutStart:(Req=0/0/0, Item=0/0), PutEnd:(Req=0/0/0, Item=0/0), PutRevoke:(Req=0/0/0, Item=0/0), Get:(Req=0/0/0, Item=0/0), ExistKey:(Req=0/0/0, Item=0/0),  | Eviction: Success/Attempts=0/0, keys=0, size=0 B

2.效果测试

准备mooncake配置文件，后续提供给推理服务。样例如下：

{
    "local_hostname": "xxxxxx",                  
    "metadata_server": "P2PHANDSHAKE",                
    "protocol": "ascend",
    "device_name": "",
    "global_segment_size": 62474836480,
    "master_server_address": "xxxxxxx:50088",
    "use_ascend_direct":true
}

2.1 vllm-ascend+mooncake

环境构建

ascend/vllm-ascend · Quay

Vllm-ascend发布的0.11.rc3版本已内置了mooncake组件。

但是镜像内的vllm和vllm-ascend版本不匹配。

需要重新安装先关组件

#pip list |grep torch
torch                             2.7.1+cpu
torch_npu                         2.7.1
torchvision                       0.22.1

# pip list |grep vllm 
vllm                              0.11.2+empty  /vllm-workspace/vllm
vllm_ascend                       0.11.0rc3     /vllm-workspace/vllm-ascend

需要降低vllm版本到0.11.0。

如果只对vllm降级，也会报错

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torch-npu 2.7.1 requires torch==2.7.1, but you have torch 2.8.0 which is incompatible.
vllm-ascend 0.11.0rc3 requires torch==2.7.1, but you have torch 2.8.0 which is incompatible.

vllm和torch相关的版本也有匹配关系，需要对相关版本做统一的更新：

pip install pip install vllm==0.11.0 torch_npu==2.8.0

安装完成后使用vllm serve进行测试

vllm serve --help
INFO 12-08 13:40:01 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 12-08 13:40:01 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 12-08 13:40:01 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 12-08 13:40:01 [__init__.py:207] Platform plugin ascend is activated
WARNING 12-08 13:40:07 [_custom_ops.py:20] Failed to import from vllm._C with ImportError('libcudart.so.12: cannot open shared object file: No such file or directory')
WARNING 12-08 13:40:07 [registry.py:582] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 12-08 13:40:07 [registry.py:582] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 12-08 13:40:07 [registry.py:582] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 12-08 13:40:07 [registry.py:582] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 12-08 13:40:07 [registry.py:582] Model architecture Qwen2_5OmniModel is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_omni_thinker:AscendQwen2_5OmniThinkerForConditionalGeneration.
WARNING 12-08 13:40:07 [registry.py:582] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3_2:CustomDeepseekV3ForCausalLM.
WARNING 12-08 13:40:07 [registry.py:582] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
INFO 12-08 13:40:08 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
usage: vllm serve [model_tag] [options]

Launch a local OpenAI-compatible API server to serve LLM
completions via HTTP. Defaults to Qwen/Qwen3-0.6B if no model is specified.

Search by using: `--help=<ConfigGroup>` to explore options by section (e.g.,
--help=ModelConfig, --help=Frontend)
  Use `--help=all` to show all available flags at once.

Config Groups:
  positional arguments    
  options                 
  Frontend                Arguments for the OpenAI-compatible frontend server.
  ModelConfig             Configuration for the model.
  LoadConfig              Configuration for loading the model weights.
  StructuredOutputsConfig Dataclass which contains structured outputs config for the engine.
  ParallelConfig          Configuration for the distributed execution.
  CacheConfig             Configuration for the KV cache.
  MultiModalConfig        Controls the behavior of multimodal models.
  LoRAConfig              Configuration for LoRA.
  ObservabilityConfig     Configuration for observability - metrics and tracing.
  SchedulerConfig         Scheduler configuration.
  VllmConfig              Dataclass which contains all vllm-related configuration. This
      simplifies passing around the distinct configurations in the codebase.
      

For full list:            vllm serve --help=all
For a section:            vllm serve --help=ModelConfig    (case-insensitive)
For a flag:               vllm serve --help=max-model-len  (_ or - accepted)
Documentation:            https://docs.vllm.ai

mooncake配置文件，配置100GB ssd容量

{
    "local_hostname": "1.1.1.1",                  
    "metadata_server": "P2PHANDSHAKE",                
    "protocol": "ascend",
    "device_name": "",
    "global_segment_size": 105474836480,
    "master_server_address": "2.2.2.2:50058",
    "use_ascend_direct":true,
    "alloc_in_same_node": true
}

服务运行

export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

export PYTHONPATH=$PYTHONPATH:/vllm-workspace/vllm

export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export MOONCAKE_CONFIG_PATH="/opt/files/src/kv-cache/conf/mooncake_vllm.json"
export ACL_OP_INIT_MODE=1
export ASCEND_BUFFER_POOL=4:8
export ASCEND_CONNECT_TIMEOUT=10000
export ASCEND_TRANSFER_TIMEOUT=10000


vllm serve /opt/models/Qwen2p5-7B-Instruct/     --served-model-name qwen2p5-7b-mooncake     --dtype bfloat16      --max-model-len 32768      --tensor-parallel-size 1     --host 0.0.0.0      --port 31001      --enforce-eager      --enable-prefix-caching      --block-size 128      --max-num-batched-tokens 8192      --gpu-memory-utilization 0.59      --kv-transfer-config '{        "kv_connector": "MooncakeConnectorStoreV1",        "kv_role": "kv_both",        "kv_connector_extra_config": {            "use_layerwise": false,            "mooncake_rpc_port": "0",            "load_async": true,            "register_buffer": true        }    }'

启动过程关键log：

#### 推理框架关键log：

(EngineCore_DP0 pid=2392) INFO 12-08 14:01:29 [factory.py:51] Creating v1 connector with name: MooncakeConnectorV1 and engine_id: 3e9a8a19-e986-4353-b826-e25fe09b5146
(EngineCore_DP0 pid=2392) WARNING 12-08 14:01:29 [base.py:86] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20251208 14:01:29.602275  2392 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I20251208 14:01:29.602348  2392 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 30.189.250.94 port: 12001
I20251208 14:01:29.602769  2392 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 30.189.250.94:16341
I20251208 14:01:29.602919  2392 ascend_direct_transport.cpp:86] install AscendDirectTransport for: 30.189.250.94:16341
I20251208 14:01:29.602973  2392 ascend_direct_transport.cpp:477] Find available between 26000 and 27000
I20251208 14:01:29.603039  2392 ascend_direct_transport.cpp:442] AscendDirectTransport set segment desc: host_ip=30.189.250.94, host_port=26957, deviceLogicId=0
I20251208 14:01:29.603081  2392 ascend_direct_transport.cpp:164] Set adxl.BufferPool to:4:8
I20251208 14:01:29.611251  2392 ascend_direct_transport.cpp:177] Success to initialize adxl engine:30.189.250.94:26957 with device_id:0
I20251208 14:01:29.611310  2392 ascend_direct_transport.cpp:186] Set connection timeout to:10000
I20251208 14:01:29.611330  2392 ascend_direct_transport.cpp:195] Set transfer timeout to:10000
I20251208 14:01:29.613250  2591 ascend_direct_transport.cpp:512] AscendDirectTransport worker thread started
I20251208 14:01:29.613384  2392 client_metric.cpp:76] Client metrics enabled (default enabled)

....
....

(EngineCore_DP0 pid=2392) INFO 12-08 14:01:39 [worker_v1.py:256] Available memory: 21181885132, total memory: 65452113920
(EngineCore_DP0 pid=2392) INFO 12-08 14:01:39 [kv_cache_utils.py:1087] GPU KV cache size: 369,280 tokens
(EngineCore_DP0 pid=2392) INFO 12-08 14:01:39 [kv_cache_utils.py:1091] Maximum concurrency for 32,768 tokens per request: 11.27x
(EngineCore_DP0 pid=2392) INFO 12-08 14:01:39 [mooncake_engine.py:102] num_blocks: 2975, block_shape: torch.Size([128, 4, 128])
(EngineCore_DP0 pid=2392) INFO 12-08 14:01:39 [mooncake_engine.py:105] Registering KV_Caches. use_mla: False, shape torch.Size([2885,, 128, 4, 128])


### mooncake——master log
### 推理框架成功启动后，

I20251208 14:01:26.540841   484 rpc_service.cpp:40] Master Metrics: Mem Storage: 0 B / 0 B | SSD Storage: 0 B / 0 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): PutStart=0/0, PutEnd=0/0, PutRevoke=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0,  | Batch Requests (Req=Success/PartialSuccess/Total, Item=Success/Total): PutStart:(Req=0/0/0, Item=0/0), PutEnd:(Req=0/0/0, Item=0/0), PutRevoke:(Req=0/0/0, Item=0/0), Get:(Req=0/0/0, Item=0/0), ExistKey:(Req=0/0/0, Item=0/0),  | Eviction: Success/Attempts=0/0, keys=0, size=0 B
I20251208 14:01:29.618252   490 master_service.cpp:651] Storage root directory or cluster ID is not set. persisting data is disabled.
I20251208 14:01:36.541100   484 rpc_service.cpp:40] Master Metrics: Mem Storage: 0 B / 98.23 GB (0.0%) | SSD Storage: 0 B / 0 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): PutStart=0/0, PutEnd=0/0, PutRevoke=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0,  | Batch Requests (Req=Success/PartialSuccess/Total, Item=Success/Total): PutStart:(Req=0/0/0, Item=0/0), PutEnd:(Req=0/0/0, Item=0/0), PutRevoke:(Req=0/0/0, Item=0/0), Get:(Req=0/0/0, Item=0/0), ExistKey:(Req=0/0/0, Item=0/0),  | Eviction: Success/Attempts=0/0, keys=0, size=0 B
I20251208 14:01:46.541324   484 rpc_service.cpp:40] Master Metrics: Mem Storage: 0 B / 98.23 GB (0.0%) | SSD Storage: 0 B / 0 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): PutStart=0/0, PutEnd=0/0, PutRevoke=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0,  | Batch Requests (Req=Success/PartialSuccess/Total, Item=Success/Total): PutStart:(Req=0/0/0, Item=0/0), PutEnd:(Req=0/0/0, Item=0/0), PutRevoke:(Req=0/0/0, Item=0/0), Get:(Req=0/0/0, Item=0/0), ExistKey:(Req=0/0/0, Item=0/0),  | Eviction: Success/Attempts=0/0, keys=0, size=0 B

3.效果评价

使用lmcache的benchmark，long-doc-qa测试方法，构建数据集分别两次调用同一个服务，测试ttft的时延。

Benchmarking | LMCache

模型：qwen2.5-7B

测试样本集数据特点：

输入10000，输出50，

20G的存储空间可以保持370KB的token的kvcache，大约35条样本。

工具——kvcahe计算器

https://docs.lmcache.ai/getting_started/kv_cache_calculator.html

3.1 vllm-ascend+mooncake

样本数量	10	20	30	40	50	60	70	80	90	100	110
HBM-20GB	0.4104	0.3742	0.3371	3.0918	3.1023	2.9289	2.8651	2.8182	2.7941	2.7379	2.6827
HBM-40GB	0.4176	0.3368	0.2921	0.3228	0.3171	0.2584	0.2509	2.7229	2.7128	2.6875	2.6654
HBM-20GB+mooncake-20GB	0.5016	0.4471	0.4231	3.6502	3.158	3.1518	3.0134	3.1044	2.9536	2.9289	2.9066
HBM-20GB+mooncake-40GB	0.5126	0.5122	0.4367	0.5718	0.5514	0.5507	3.027	3.1057	2.8675	2.8953	2.8716
HBM-20GB+mooncake-60GB	0.5043	0.4528	0.425	0.5332	0.5146	0.5031	0.5487	0.5292	0.5367	0.5358	2.8285