引言
Qwen团队开源的QwQ-32B大模型,凭借强大的生成能力,成为开发者关注的焦点。但如何在生产环境中高效部署并优化其性能?本文基于真实场景(8卡NVIDIA L20服务器),从环境配置、vLLM推理引擎部署到高并发压测,全流程解析性能瓶颈与优化方案,并公开独家压测数据!
一、环境准备
1.1 硬件配置
服务器 | 数量 | CPU | 内存(TB) | 系统版本 |
NVIDIA L20 48GB * 8 | 1 | INTEL 8458P *2 | 2 | Ubuntu 20.04 |
1.2 系统环境
软件名称 | 版本 | 备注 |
NVIDIA Driver | 550.54.14 | GPU驱动 |
CUDA | 12.4 | Cuda |
vLLM | 0.7.3 | LLM推理引擎 |
1.3 QwQ-32B模型下载
- 方式一:通过 摩搭社区 下载
仓库地址:https://modelscope.cn/models/Qwen/QwQ-32B
- 方式二:通过HuggingFace镜像站-AI快站下载
仓库地址:https://aifasthub.com/Qwen/QwQ-32B
#下载AI快站下载器
wget https://fast360.xyz/images/hf-fast.sh
chmod a+x hf-fast.sh
#下载模型文件
./hf-fast.sh Qwen/QwQ-32B
1.4 系统初始化
系统初始化是确保环境稳定运行的重要步骤,具体操作可以参考之前的文章:生产环境H200部署DeepSeek 671B 满血版全流程实战(一):系统初始化
- 安装nvitop:nvitop 是一款基于 Python 开发的交互式 NVIDIA GPU 监控工具,能够实时显示 GPU 利用率、显存占用、进程详情等信息,并以彩色界面和动态图表提供直观的可视化监控体验。
pip install --upgrade pip
pip install nvitop
二、安装vLLM
1. 创建虚拟环境
为避免依赖冲突,使用Conda创建独立环境:
conda create -n qwq python=3.10
conda activate qwq
2. 安装最新版vLLM
升级pip后直接安装vLLM:
pip install --upgrade pip
pip install vllm -i https://mirrors.aliyun.com/pypi/simple/
验证安装是否成功:
pip show vllm
三、运行vLLM服务
1. 激活 vllm 环境,确保在 vllm 的 conda 环境中
conda activate qwq
2. 启动 vLLM 服务: 使用 vllm serve 命令启动 vLLM 服务,并加载 QwQ-32B模型。
conda activate qwq
vllm serve /home/ubuntu/QwQ-32B --tensor-parallel-size 2 --max-model-len 16384 --port 8102 --trust-remote-code --served-model-name qwq-32b --enable-chunked-prefill --max-num-batched-tokens 2048 --gpu-memory-utilization 0.95
3. 启动日志
(base) ubuntu@node002:~$ conda activate qwq
(qwq) ubuntu@node002:~$ vllm serve /home/ubuntu/QwQ-32B --tensor-parallel-size 2 --max-model-len 16384 --port 8102 --trust-remote-code --served-model-name qwq-32b --enable-chunked-prefill --max-num-batched-tokens 2048 --gpu-memory-utilization 0.95
INFO 03-13 22:12:05 __init__.py:207] Automatically detected platform cuda.
INFO 03-13 22:12:06 api_server.py:912] vLLM API server version 0.7.3
INFO 03-13 22:12:06 api_server.py:913] args: Namespace(subparser='serve', model_tag='/home/ubuntu/QwQ-32B', config='', host=None, port=8102, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/home/ubuntu/QwQ-32B', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=16384, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=2048, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=True, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['qwq-32b'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function ServeSubcommand.cmd at 0x7fd6eeb7f6d0>)
INFO 03-13 22:12:06 api_server.py:209] Started engine process with PID 42946
INFO 03-13 22:12:09 __init__.py:207] Automatically detected platform cuda.
INFO 03-13 22:12:10 config.py:549] This model supports multiple tasks: {'reward', 'generate', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 03-13 22:12:10 config.py:1382] Defaulting to use mp for distributed inference
INFO 03-13 22:12:10 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 03-13 22:12:14 config.py:549] This model supports multiple tasks: {'classify', 'score', 'embed', 'reward', 'generate'}. Defaulting to 'generate'.
INFO 03-13 22:12:14 config.py:1382] Defaulting to use mp for distributed inference
INFO 03-13 22:12:14 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 03-13 22:12:14 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/home/ubuntu/QwQ-32B', speculative_config=None, tokenizer='/home/ubuntu/QwQ-32B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=qwq-32b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
WARNING 03-13 22:12:15 multiproc_worker_utils.py:300] Reducing Torch parallelism from 88 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 03-13 22:12:15 custom_cache_manager.py:19] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 03-13 22:12:15 cuda.py:229] Using Flash Attention backend.
INFO 03-13 22:12:18 __init__.py:207] Automatically detected platform cuda.
(VllmWorkerProcess pid=43249) INFO 03-13 22:12:18 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=43249) INFO 03-13 22:12:19 cuda.py:229] Using Flash Attention backend.
INFO 03-13 22:12:20 utils.py:916] Found nccl from library libnccl.so.2
INFO 03-13 22:12:20 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=43249) INFO 03-13 22:12:20 utils.py:916] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=43249) INFO 03-13 22:12:20 pynccl.py:69] vLLM is using nccl==2.21.5
INFO 03-13 22:12:20 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/ubuntu/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorkerProcess pid=43249) INFO 03-13 22:12:20 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/ubuntu/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 03-13 22:12:20 shm_broadcast.py:258] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_5916f4e6'), local_subscribe_port=40533, remote_subscribe_port=None)
INFO 03-13 22:12:20 model_runner.py:1110] Starting to load model /home/ubuntu/QwQ-32B...
(VllmWorkerProcess pid=43249) INFO 03-13 22:12:20 model_runner.py:1110] Starting to load model /home/ubuntu/QwQ-32B...
Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 7% Completed | 1/14 [00:00<00:08, 1.48it/s]
Loading safetensors checkpoint shards: 14% Completed | 2/14 [00:01<00:08, 1.45it/s]
Loading safetensors checkpoint shards: 21% Completed | 3/14 [00:02<00:07, 1.44it/s]
Loading safetensors checkpoint shards: 29% Completed | 4/14 [00:02<00:06, 1.46it/s]
Loading safetensors checkpoint shards: 36% Completed | 5/14 [00:03<00:05, 1.60it/s]
Loading safetensors checkpoint shards: 43% Completed | 6/14 [00:03<00:05, 1.58it/s]
Loading safetensors checkpoint shards: 50% Completed | 7/14 [00:04<00:04, 1.52it/s]
Loading safetensors checkpoint shards: 57% Completed | 8/14 [00:05<00:03, 1.53it/s]
Loading safetensors checkpoint shards: 64% Completed | 9/14 [00:05<00:03, 1.53it/s]
Loading safetensors checkpoint shards: 71% Completed | 10/14 [00:06<00:02, 1.49it/s]
Loading safetensors checkpoint shards: 79% Completed | 11/14 [00:07<00:02, 1.49it/s]
Loading safetensors checkpoint shards: 86% Completed | 12/14 [00:07<00:01, 1.51it/s]
Loading safetensors checkpoint shards: 93% Completed | 13/14 [00:08<00:00, 1.87it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:08<00:00, 1.78it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:08<00:00, 1.59it/s]
(VllmWorkerProcess pid=43249) INFO 03-13 22:12:29 model_runner.py:1115] Loading model weights took 30.7117 GB
INFO 03-13 22:12:29 model_runner.py:1115] Loading model weights took 30.7117 GB
(VllmWorkerProcess pid=43249) INFO 03-13 22:12:32 worker.py:267] Memory profiling takes 2.31 seconds
(VllmWorkerProcess pid=43249) INFO 03-13 22:12:32 worker.py:267] the current vLLM instance can use total_gpu_memory (44.52GiB) x gpu_memory_utilization (0.95) = 42.29GiB
(VllmWorkerProcess pid=43249) INFO 03-13 22:12:32 worker.py:267] model weights take 30.71GiB; non_torch_memory takes 0.33GiB; PyTorch activation peak memory takes 0.26GiB; the rest of the memory reserved for KV Cache is 10.99GiB.
INFO 03-13 22:12:32 worker.py:267] Memory profiling takes 2.40 seconds
INFO 03-13 22:12:32 worker.py:267] the current vLLM instance can use total_gpu_memory (44.52GiB) x gpu_memory_utilization (0.95) = 42.29GiB
INFO 03-13 22:12:32 worker.py:267] model weights take 30.71GiB; non_torch_memory takes 0.35GiB; PyTorch activation peak memory takes 1.41GiB; the rest of the memory reserved for KV Cache is 9.83GiB.
INFO 03-13 22:12:32 executor_base.py:111] # cuda blocks: 5031, # CPU blocks: 2048
INFO 03-13 22:12:32 executor_base.py:116] Maximum concurrency for 16384 tokens per request: 4.91x
INFO 03-13 22:12:33 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/35 [00:00<?, ?it/s](VllmWorkerProcess pid=43249) INFO 03-13 22:12:33 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:16<00:00, 2.17it/s]
INFO 03-13 22:12:49 custom_all_reduce.py:226] Registering 4515 cuda graph addresses
(VllmWorkerProcess pid=43249) INFO 03-13 22:12:50 custom_all_reduce.py:226] Registering 4515 cuda graph addresses
(VllmWorkerProcess pid=43249) INFO 03-13 22:12:50 model_runner.py:1562] Graph capturing finished in 17 secs, took 0.35 GiB
INFO 03-13 22:12:50 model_runner.py:1562] Graph capturing finished in 17 secs, took 0.35 GiB
INFO 03-13 22:12:50 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 21.36 seconds
INFO 03-13 22:12:51 api_server.py:958] Starting vLLM API server on http://0.0.0.0:8102
INFO 03-13 22:12:51 launcher.py:23] Available routes are:
INFO 03-13 22:12:51 launcher.py:31] Route: /openapi.json, Methods: HEAD, GET
INFO 03-13 22:12:51 launcher.py:31] Route: /docs, Methods: HEAD, GET
INFO 03-13 22:12:51 launcher.py:31] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 03-13 22:12:51 launcher.py:31] Route: /redoc, Methods: HEAD, GET
INFO 03-13 22:12:51 launcher.py:31] Route: /health, Methods: GET
INFO 03-13 22:12:51 launcher.py:31] Route: /ping, Methods: POST, GET
INFO 03-13 22:12:51 launcher.py:31] Route: /tokenize, Methods: POST
INFO 03-13 22:12:51 launcher.py:31] Route: /detokenize, Methods: POST
INFO 03-13 22:12:51 launcher.py:31] Route: /v1/models, Methods: GET
INFO 03-13 22:12:51 launcher.py:31] Route: /version, Methods: GET
INFO 03-13 22:12:51 launcher.py:31] Route: /v1/chat/completions, Methods: POST
INFO 03-13 22:12:51 launcher.py:31] Route: /v1/completions, Methods: POST
INFO 03-13 22:12:51 launcher.py:31] Route: /v1/embeddings, Methods: POST
INFO 03-13 22:12:51 launcher.py:31] Route: /pooling, Methods: POST
INFO 03-13 22:12:51 launcher.py:31] Route: /score, Methods: POST
INFO 03-13 22:12:51 launcher.py:31] Route: /v1/score, Methods: POST
INFO 03-13 22:12:51 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST
INFO 03-13 22:12:51 launcher.py:31] Route: /rerank, Methods: POST
INFO 03-13 22:12:51 launcher.py:31] Route: /v1/rerank, Methods: POST
INFO 03-13 22:12:51 launcher.py:31] Route: /v2/rerank, Methods: POST
INFO 03-13 22:12:51 launcher.py:31] Route: /invocations, Methods: POST
INFO: Started server process [42868]
INFO: Waiting for application startup.
INFO: Application startup complete.
四、验证服务可用性
通过API接口发送测试请求,确认服务正常运行:
curl -X POST "http://localhost:8102/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "qwq-32b",
"messages": [{"role": "user", "content": "写个100字的散文"}]
}'
预期结果:返回模型生成的文本内容,表示服务部署成功。
五、在OpenwebUI配置使用
在 OpenwebUI 中使用 QwQ-32B 模型非常简单,只需选择右上角“管理员面板-设置-外部连接”,配置模型
刷新提示成功后,就可以直接在Openwebui中使用QwQ-32B模型
六、压力测试
6.1 安装EvalScope
为了进行压力测试,我们需要先安装 EvalScope 工具:
#安装EvalScope
pip install 'evalscope[app,perf]' -U -i https://mirrors.aliyun.com/pypi/simple/
6.2 压测命令
使用以下命令进行压力测试:
evalscope perf --parallel 256 --url http://127.0.0.1:8102/v1/chat/completions --model qwq-32b --log-every-n-query 10 --connect-timeout 600 --read-timeout 600 --api openai --prompt '写一个科幻小说,不少于2000字' -n 256
6.3 使用2张L20压测
QwQ-32B用2张L20来加载模型,我们先验证2张L20的性能表现。
压测结论:
- 在256并发下,吞吐量为682.65 tokens/s
- 每秒生成请求数仅为0.336,说明高并发(256)下系统处理请求的并行效率较低
压测系统负载:
压测结果:
6.4 整机L20压测
6.4.1 vLLM启用多端口
测试整机的性能,我们需要通过 vLLM 启动多个进程和端口,并使用 Nginx 进行负载均衡。
启动多个 vLLM 进程,通过设置环境变量 CUDA_VISIBLE_DEVICES 来指定每张 GPU 加载模型:
conda activate qwq
export CUDA_VISIBLE_DEVICES=0,1
vllm serve /home/ubuntu/QwQ-32B --tensor-parallel-size 2 --max-model-len 16384 --port 8102 --trust-remote-code --served-model-name qwq-32b --enable-chunked-prefill --max-num-batched-tokens 2048 --gpu-memory-utilization 0.95
export CUDA_VISIBLE_DEVICES=2,3
vllm serve /home/ubuntu/QwQ-32B --tensor-parallel-size 2 --max-model-len 16384 --port 8104 --trust-remote-code --served-model-name qwq-32b --enable-chunked-prefill --max-num-batched-tokens 2048 --gpu-memory-utilization 0.95
export CUDA_VISIBLE_DEVICES=4,5
vllm serve /home/ubuntu/QwQ-32B --tensor-parallel-size 2 --max-model-len 16384 --port 8106 --trust-remote-code --served-model-name qwq-32b --enable-chunked-prefill --max-num-batched-tokens 2048 --gpu-memory-utilization 0.95
export CUDA_VISIBLE_DEVICES=6,7
vllm serve /home/ubuntu/QwQ-32B --tensor-parallel-size 2 --max-model-len 16384 --port 8108 --trust-remote-code --served-model-name qwq-32b --enable-chunked-prefill --max-num-batched-tokens 2048 --gpu-memory-utilization 0.95
6.4.2 安装nginx
sudo apt install nginx
编辑 /etc/nginx/sites-enabled/proxy.conf ,添加如下配置:
# /etc/nginx/sites-enabled/proxy.conf
# 定义 upstream 块,指定要转发的后端服务器列表
upstream backend_servers {
# 默认采用轮询策略
server 127.0.0.1:8102;
server 127.0.0.1:8104;
server 127.0.0.1:8106;
server 127.0.0.1:8108;
}
server {
listen 8080;
server_name _;
location / {
# 将请求转发到 upstream 定义的后端服务器组
proxy_pass http://backend_servers;
# 传递原始请求的 Host 头
proxy_set_header Host $host;
# 传递客户端真实 IP 地址
proxy_set_header X-Real-IP $remote_addr;
# 传递客户端的完整请求头
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
# 错误日志配置
error_log /var/log/nginx/proxy_error.log;
# 访问日志配置
access_log /var/log/nginx/proxy_access.log;
}
然后优化nginx配置文件sudo vim /etc/nginx/nginx.conf
user www-data;
worker_processes auto;
worker_rlimit_nofile 100000;
#pid /run/nginx.pid;
pid logs/nginx.pid;
events {
use epoll;
worker_connections 2048;
multi_accept on;
}
- 检查配置文件语法并重启 Nginx 服务:
sudo nginx -t
sudo systemctl restart nginx
完成上述操作后,所有发送到8080 端口的请求就会以轮询的方式依次转发到 8102、8104、8106 和 8108 端口。
6.4.3 压测数据
压测结论
- 吞吐量从 128并发时的1492 tokens/s 提升至 768并发时的2660 tokens/s,增长约78%
- 从128→256并发时吞吐量增长45%,而512→768时仅增长5%,硬件资源接近性能瓶颈。
压测详细数据
768个并发截图
压测系统负载:
总结
本文详细介绍了 QwQ-32B 模型的部署与测试过程,从环境准备到压力测试,每一步都提供了清晰的操作指南和实际示例。通过这些步骤,大家能够轻松地在自己的环境中部署和测试 QwQ-32B 模型,并根据压测数据对模型性能有更深入的了解。希望本文能为大家在人工智能模型的应用和优化方面提供有价值的参考。