(vllm) zhzx@zhzx-S2600WF-LS:/media/zhzx/ssd2/Qwen3-32B-AWQ$ CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.api_server \
--model /media/zhzx/ssd2/Qwen3-32B-AWQ \
--tensor-parallel-size 2 \
--quantization awq \
--trust-remote-code
INFO 05-12 11:01:48 [__init__.py:239] Automatically detected platform cuda.
INFO 05-12 11:01:50 [api_server.py:121] vLLM API server version 0.8.5.post1
INFO 05-12 11:01:50 [api_server.py:122] args: Namespace(host=None, port=8000, ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, log_level='debug', model='/media/zhzx/ssd2/Qwen3-32B-AWQ', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=None, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization='awq', rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False)
INFO 05-12 11:01:56 [config.py:717] This model supports multiple tasks: {'score', 'reward', 'classify', 'embed', 'generate'}. Defaulting to 'generate'.
WARNING 05-12 11:01:57 [config.py:830] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 05-12 11:01:57 [arg_utils.py:1658] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0.
WARNING 05-12 11:01:57 [arg_utils.py:1525] Chunked prefill is enabled by default for models with max_model_len > 32K. Chunked prefill might not work with some features or models. If you encounter any issues, please disable by launching with --enable-chunked-prefill=False.
INFO 05-12 11:01:57 [config.py:1770] Defaulting to use mp for distributed inference
INFO 05-12 11:01:57 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 05-12 11:01:57 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/media/zhzx/ssd2/Qwen3-32B-AWQ', speculative_config=None, tokenizer='/media/zhzx/ssd2/Qwen3-32B-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=40960, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/media/zhzx/ssd2/Qwen3-32B-AWQ, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
WARNING 05-12 11:01:57 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(VllmWorkerProcess pid=100643) INFO 05-12 11:01:57 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
INFO 05-12 11:01:57 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 05-12 11:01:57 [cuda.py:289] Using XFormers backend.
(VllmWorkerProcess pid=100643) INFO 05-12 11:01:57 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=100643) INFO 05-12 11:01:57 [cuda.py:289] Using XFormers backend.
INFO 05-12 11:01:58 [utils.py:1055] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=100643) INFO 05-12 11:01:58 [utils.py:1055] Found nccl from library libnccl.so.2
INFO 05-12 11:01:58 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=100643) INFO 05-12 11:01:58 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=100643) INFO 05-12 11:01:58 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/zhzx/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 05-12 11:01:58 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/zhzx/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 05-12 11:01:58 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_91501777'), local_subscribe_addr='ipc:///tmp/c7c55d66-4bbe-451b-a9f3-3133e26fdb9b', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorkerProcess pid=100643) INFO 05-12 11:01:58 [parallel_state.py:1004] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
INFO 05-12 11:01:58 [parallel_state.py:1004] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-12 11:01:58 [model_runner.py:1108] Starting to load model /media/zhzx/ssd2/Qwen3-32B-AWQ...
(VllmWorkerProcess pid=100643) INFO 05-12 11:01:58 [model_runner.py:1108] Starting to load model /media/zhzx/ssd2/Qwen3-32B-AWQ...
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:01<00:04, 1.51s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:02<00:02, 1.36s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:04<00:01, 1.52s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00, 1.41s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00, 1.43s/it]
(VllmWorkerProcess pid=100643) INFO 05-12 11:02:04 [loader.py:458] Loading weights took 5.72 seconds
INFO 05-12 11:02:04 [loader.py:458] Loading weights took 5.79 seconds
INFO 05-12 11:02:05 [model_runner.py:1140] Model loading took 9.0568 GiB and 5.991388 seconds
(VllmWorkerProcess pid=100643) INFO 05-12 11:02:05 [model_runner.py:1140] Model loading took 9.0568 GiB and 5.932729 seconds
(VllmWorkerProcess pid=100643) INFO 05-12 11:02:10 [worker.py:287] Memory profiling takes 5.32 seconds
(VllmWorkerProcess pid=100643) INFO 05-12 11:02:10 [worker.py:287] the current vLLM instance can use total_gpu_memory (23.64GiB) x gpu_memory_utilization (0.90) = 21.27GiB
(VllmWorkerProcess pid=100643) INFO 05-12 11:02:10 [worker.py:287] model weights take 9.06GiB; non_torch_memory takes 0.13GiB; PyTorch activation peak memory takes 0.41GiB; the rest of the memory reserved for KV Cache is 11.68GiB.
INFO 05-12 11:02:10 [worker.py:287] Memory profiling takes 5.46 seconds
INFO 05-12 11:02:10 [worker.py:287] the current vLLM instance can use total_gpu_memory (23.64GiB) x gpu_memory_utilization (0.90) = 21.27GiB
INFO 05-12 11:02:10 [worker.py:287] model weights take 9.06GiB; non_torch_memory takes 0.13GiB; PyTorch activation peak memory takes 1.40GiB; the rest of the memory reserved for KV Cache is 10.68GiB.
INFO 05-12 11:02:11 [executor_base.py:112] # cuda blocks: 5469, # CPU blocks: 2048
INFO 05-12 11:02:11 [executor_base.py:117] Maximum concurrency for 40960 tokens per request: 2.14x
INFO 05-12 11:02:13 [model_runner.py:1450] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/35 [00:00<?, ?it/s](VllmWorkerProcess pid=100643) INFO 05-12 11:02:13 [model_runner.py:1450] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:29<00:00, 1.18it/s]
INFO 05-12 11:02:43 [custom_all_reduce.py:195] Registering 4515 cuda graph addresses
(VllmWorkerProcess pid=100643) INFO 05-12 11:02:43 [custom_all_reduce.py:195] Registering 4515 cuda graph addresses
(VllmWorkerProcess pid=100643) INFO 05-12 11:02:43 [model_runner.py:1592] Graph capturing finished in 30 secs, took 0.97 GiB
INFO 05-12 11:02:43 [model_runner.py:1592] Graph capturing finished in 30 secs, took 0.97 GiB
INFO 05-12 11:02:43 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 38.52 seconds
INFO 05-12 11:02:43 [launcher.py:28] Available routes are:
INFO 05-12 11:02:43 [launcher.py:36] Route: /openapi.json, Methods: GET, HEAD
INFO 05-12 11:02:43 [launcher.py:36] Route: /docs, Methods: GET, HEAD
INFO 05-12 11:02:43 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 05-12 11:02:43 [launcher.py:36] Route: /redoc, Methods: GET, HEAD
INFO 05-12 11:02:43 [launcher.py:36] Route: /health, Methods: GET
INFO 05-12 11:02:43 [launcher.py:36] Route: /generate, Methods: POST
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/starlette/datastructures.py", line 668, in __getattr__
[rank0]: return self._state[key]
[rank0]: ~~~~~~~~~~~^^^^^
[rank0]: KeyError: 'engine_client'
[rank0]: During handling of the above exception, another exception occurred:
[rank0]: Traceback (most recent call last):
[rank0]: File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]: File "<frozen runpy>", line 88, in _run_code
[rank0]: File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/api_server.py", line 177, in <module>
[rank0]: asyncio.run(run_server(args))
[rank0]: File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/asyncio/runners.py", line 195, in run
[rank0]: return runner.run(main)
[rank0]: ^^^^^^^^^^^^^^^^
[rank0]: File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/asyncio/runners.py", line 118, in run
[rank0]: return self._loop.run_until_complete(task)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
[rank0]: return future.result()
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/api_server.py", line 129, in run_server
[rank0]: shutdown_task = await serve_http(
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/launcher.py", line 46, in serve_http
[rank0]: watchdog_loop(server, app.state.engine_client))
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/starlette/datastructures.py", line 671, in __getattr__
[rank0]: raise AttributeError(message.format(self.__class__.__name__, key))
[rank0]: AttributeError: 'State' object has no attribute 'engine_client'
[rank0]:[W512 11:02:45.924296727 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
/home/zhzx/miniconda3/envs/vllm/lib/python3.12/multiprocessing/resource_tracker.py:255: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
最新发布