作者:昇腾PAE技术支持团队
昇腾案例库简介:https://agent.blog.youkuaiyun.com/article/details/155446713
昇腾案例抢鲜预览:https://gitcode.com/invite/link/8791cccc43cb4ee589e8
(如对本文有疑问,请移步案例库提交issue,专人答疑)
vLLM-Ascend是vLLM社区官方支持的昇腾NPU专用后端,用于在 Ascend NPU 上运行 vLLM。以下是环境部署过程中,常用的一些参数设置。
1. 特性列表
| 序列 | 关键特性 |
|---|---|
| 1 | ACL Graph |
| 2 | 量化 |
| 3 | task_queue算子下发队列优化 |
| 4 | Jemalloc |
| 5 | HCCL AIV模式 |
| 6 | torch npu虚拟内存 |
| 7 | prefix cache |
| 8 | chunked prefill |
| 9 | weight nz |
| 10 | FlashComm |
| 11 | 稠密模型通用优化 |
| 12 | MLP权重预取 |
| 13 | mooncake池化 |
| 14 | Torchair图模式优化 |
2. 开启方式
2.1 ACL Graph图模式
v0.9.1rc1版本开始,V1 Engine下默认开启ACL Graph。
使能V1 Engine:
export VLLM_USE_V1=1
如果在执行过程中遇到问题,可以通过以下方式切回Eager模式,辅助问题定界。
离线:
import os
from vllm import LLM
model = LLM(model="someother_model_weight", enforce_eager=True)
outputs = model.generate("Hello, how are you?")
在线:
vllm serve Qwen/Qwen2-7B-Instruct --enforce-eager
2.2 量化
使用量化权重时需要指定quantization=‘ascned’,使用浮点权重时需要删掉该参数。
离线:
import torch
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
llm = LLM(model="{quantized_model_save_path}",
max_model_len=2048,
trust_remote_code=True,
# Enable quantization by specifying `quantization="ascend"`
quantization="ascend")
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
在线:
vllm serve /home/models/Qwen3-8B-w4a8 --served-model-name "qwen3-8b-w4a8" --max-model-len 4096 --quantization ascend
2.3 task_queue算子下发队列优化
详细说明可参考文档:https://www.hiascend.com/document/detail/zh/Pytorch/710/comref/Envvariables/Envir_007.html
export TASK_QUEUE_ENABLE=0 # 关闭优化
export TASK_QUEUE_ENABLE=1 # 开启level1优化
export TASK_QUEUE_ENABLE=2 # 开启level2优化
2.4 Jemalloc
Jemalloc安装参考:https://www.hiascend.com/document/detail/zh/mindie/21RC2/mindieservice/servicedev/mindie_service0381.html
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2 $LD_PRELOAD
2.5 HCCL AIV模式
详细说明可参考文档:https://www.hiascend.com/document/detail/zh/canncommercial/82RC1/maintenref/envvar/envref_07_0096.html
export HCCL_OP_EXPANSION_MODE="AIV"
2.6 torch npu虚拟内存
详细说明可参考文档:https://www.hiascend.com/document/detail/zh/Pytorch/710/comref/Envvariables/Envir_012.html
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
2.7 prefix cache
离线:
llm = LLM(model="lmsys/longchat-13b-16k", enable_prefix_caching=True)
在线:
默认开启,可通过指定 --no-enable-prefix-caching关闭
vllm serve /home/weight/Qwen2.5-32B-Instruct --gpu-memory-utilization 0.9 --tensor-parallel-size 4 --no-enable-prefix-caching # 关闭prefix cache
2.8 chunked prefill
v1 scheduler默认开启,可以通过配置max_num_batched_tokens调整chunk大小:
离线:
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", max_num_batched_tokens=2048) # 超过2048才会切
在线:
vllm serve /home/models/Qwen3-8B-w4a8 --served-model-name "qwen3-8b-w4a8" --max-model-len 4096 ---max-num-batched-tokens 2048
2.9 Weight NZ
vllm serve /home/models/Qwen3-8B-w4a8 --served-model-name "qwen3-8b-w4a8" --additional-config '{"enable_weight_nz_layout":true}'
2.10 稠密模型通用优化
稠密模型优化总开关,需要配合以下具体特性使用:
# Whether to enable dense model and general optimizations for better performance.
# Since we modified the base parent class `linear`, this optimization is also applicable to other model types.
# However, there might be hidden issues, and it is currently recommended to prioritize its use with dense models.
export VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1
2.11 FlashComm
需要配合TP并行使用,适用于大并发场景。
# Whether to enable FlashComm optimization when tensor parallel is enabled.
# This feature will get better performance when concurrency is large.
export VLLM_ASCEND_ENABLE_FLASHCOMM=1
2.12 MLP权重预取
适用于小并发场景。
# Whether to enable MLP weight prefetch, only used in small concurrency.
export VLLM_ASCEND_ENABLE_PREFETCH_MLP=1
2.13 mooncake池化
使用说明可参考文档:https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/mooncake_connector_deployment_guide.md
在线:
# Prefill节点
vllm serve "/xxxxx/DeepSeek-V2-Lite-Chat" \
--host localhost \
--port 8100 \
--tensor-parallel-size 2\
--seed 1024 \
--max-model-len 2000 \
--max-num-batched-tokens 2000 \
--trust-remote-code \
--enforce-eager \
--data-parallel-size 2 \
--data-parallel-address localhost \
--data-parallel-rpc-port 9100 \
--gpu-memory-utilization 0.8 \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_producer",
"kv_parallel_size": 1,
"kv_port": "20001",
"engine_id": "0",
"kv_rank": 0,
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 2
},
"decode": {
"dp_size": 2,
"tp_size": 2
}
}
}'
# Decode节点
vllm serve "/xxxxx/DeepSeek-V2-Lite-Chat" \
--host localhost \
--port 8200 \
--tensor-parallel-size 2\
--seed 1024 \
--max-model-len 2000 \
--max-num-batched-tokens 2000 \
--trust-remote-code \
--enforce-eager \
--data-parallel-size 2 \
--data-parallel-address localhost \
--data-parallel-rpc-port 9100 \
--gpu-memory-utilization 0.8 \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_consumer",
"kv_parallel_size": 1,
"kv_port": "20002",
"engine_id": "1",
"kv_rank": 1,
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 2
},
"decode": {
"dp_size": 2,
"tp_size": 2
}
}
}'
2.14 Torchair图模式优化
仅支持DeepSeek系列和PanguProMoE。
在线:
vllm serve /mnt/share/weight/DeepSeek-R1-0528_w8a8_MTP_float \
--port 20002 \
--data-parallel-size 1 \
--tensor-parallel-size 16 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name dsv3 \
--max-model-len 5200 \
--max-num-batched-tokens 2048 \
--max-num-seqs 16 \
--quantization ascend \
--speculative-config '{"num_speculative_tokens":2, "method":"deepseek_mtp"}' \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--additional-config \
'{"torchair_graph_config":{"enabled":true,"enable_multistream_moe":true,"enable_super_kernel":true,"use_cached_graph":true,"enable_multistream_mla":true,"graph_batch_sizes":[16]},"chunked_prefill_for_mla":true,"enable_weight_nz_layout":true}'
1639

被折叠的 条评论
为什么被折叠?



