运行CUDA_VISIBLE_DEVICES=0,2 dbgpt start webserver --config /app/configs/dbgpt-local-vllm.toml报错如下:
=========================== VLLMDeployModelParameters ===========================
name: DeepSeek-R1-Distill-Qwen-32B
provider: vllm
verbose: False
concurrency: 100
backend: None
prompt_template: None
context_length: None
reasoning_model: None
path: models/DeepSeek-R1-Distill-Qwen-32B
device: auto
trust_remote_code: True
download_dir: None
load_format: auto
config_format: auto
dtype: auto
kv_cache_dtype: auto
seed: 0
max_model_len: None
distributed_executor_backend: None
pipeline_parallel_size: 1
tensor_parallel_size: 1
max_parallel_loading_workers: None
block_size: None
enable_prefix_caching: None
swap_space: 4.0
cpu_offload_gb: 0.0
gpu_memory_utilization: 0.9
max_num_batched_tokens: None
max_num_seqs: 2
max_logprobs: 20
revision: None
code_revision: None
tokenizer_revision: None
tokenizer_mode: auto
quantization: fp8
max_seq_len_to_capture: 8192
worker_cls: auto
extras: None
======================================================================
2025-08-05 07:43:32 1249bbe41ac7 dbgpt.util.code.server[4248] INFO Code server is ready
INFO 08-05 07:43:36 config.py:520] This model supports multiple tasks: {'score', 'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'.
WARNING 08-05 07:43:36 arg_utils.py:1107] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 08-05 07:43:36 config.py:1483] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 08-05 07:43:37 llm_engine.py:232] Initializing an LLM engine (v0.7.0) with config: model='/app/models/DeepSeek-R1-Distill-Qwen-32B', speculative_config=None, tokenizer='/app/models/DeepSeek-R1-Distill-Qwen-32B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/app/models/DeepSeek-R1-Distill-Qwen-32B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[2,1],"max_capture_size":2}, use_cached_outputs=False,
INFO 08-05 07:43:37 cuda.py:225] Using Flash Attention backend.
INFO 08-05 07:43:37 model_runner.py:1110] Starting to load model /app/models/DeepSeek-R1-Distill-Qwen-32B...
/app/packages/dbgpt-core/src/dbgpt/util/model_utils.py:27: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
if (hasattr(backends, "mps") and backends.mps.is_built()) or torch.has_mps:
2025-08-05 07:43:38 1249bbe41ac7 dbgpt.util.model_utils[4248] INFO Clear torch cache of device: cuda:0
2025-08-05 07:43:38 1249bbe41ac7 dbgpt.util.model_utils[4248] INFO Clear torch cache of device: cuda:1
2025-08-05 07:43:38 1249bbe41ac7 dbgpt.model.cluster.worker.embedding_worker[4248] INFO Load embeddings model: bge-large-zh-v1.5
2025-08-05 07:43:38 1249bbe41ac7 datasets[4248] INFO PyTorch version 2.5.1+cu121 available.
2025-08-05 07:43:38 1249bbe41ac7 datasets[4248] INFO Duckdb version 1.2.0 available.
2025-08-05 07:43:38 1249bbe41ac7 sentence_transformers.SentenceTransformer[4248] INFO Use pytorch device_name: cuda
2025-08-05 07:43:38 1249bbe41ac7 sentence_transformers.SentenceTransformer[4248] INFO Load pretrained SentenceTransformer: /app/models/bge-large-zh-v1.5
INFO: 127.0.0.1:42180 - "POST /api/controller/models HTTP/1.1" 200 OK
2025-08-05 07:43:39 1249bbe41ac7 dbgpt.model.cluster.worker.manager[4248] ERROR Error starting worker manager: model DeepSeek-R1-Distill-Qwen-32B@vllm(172.17.0.2:8002) start failed, Traceback (most recent call last):
File "/app/packages/dbgpt-core/src/dbgpt/model/cluster/worker/manager.py", line 631, in _start_worker
await self.run_blocking_func(
File "/app/packages/dbgpt-core/src/dbgpt/model/cluster/worker/manager.py", line 146, in run_blocking_func
return await loop.run_in_executor(self.executor, func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/packages/dbgpt-core/src/dbgpt/model/cluster/worker/default_worker.py", line 122, in start
self.model, self.tokenizer = self.ml.loader_with_params(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/packages/dbgpt-core/src/dbgpt/model/adapter/loader.py", line 70, in loader_with_params
return llm_adapter.load_from_params(model_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/packages/dbgpt-core/src/dbgpt/model/adapter/vllm_adapter.py", line 488, in load_from_params
engine = AsyncLLMEngine.from_engine_args(engine_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 642, in from_engine_args
engine = cls(
^^^^
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 592, in __init__
self.engine = self._engine_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 265, in __init__
super().__init__(*args, **kwargs)
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 271, in __init__
self.model_executor = executor_class(vllm_config=vllm_config, )
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 49, in __init__
self._init_executor()
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 40, in _init_executor
self.collective_rpc("load_model")
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 49, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/utils.py", line 2208, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/worker/worker.py", line 182, in load_model
self.model_runner.load_model()
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1112, in load_model
self.model = get_model(vllm_config=self.vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/model_loader/__init__.py", line 12, in get_model
return loader.load_model(vllm_config=vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 376, in load_model
model = _initialize_model(vllm_config=vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 118, in _initialize_model
return model_class(vllm_config=vllm_config, prefix=prefix)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 451, in __init__
self.model = Qwen2Model(vllm_config=vllm_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 149, in __init__
old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 305, in __init__
self.start_layer, self.end_layer, self.layers = make_layers(
^^^^^^^^^^^^
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 555, in make_layers
[PPMissingLayer() for _ in range(start_layer)] + [
^
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 556, in <listcomp>
maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 307, in <lambda>
lambda prefix: Qwen2DecoderLayer(config=config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 206, in __init__
self.self_attn = Qwen2Attention(
^^^^^^^^^^^^^^^
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 134, in __init__
self.qkv_proj = QKVParallelLinear(
^^^^^^^^^^^^^^^^^^
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 728, in __init__
super().__init__(input_size=input_size,
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 311, in __init__
self.quant_method.create_weights(
File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/fp8.py", line 199, in create_weights
weight = ModelWeightParameter(data=torch.empty(
^^^^^^^^^^^^
File "//opt/.uv.venv/lib/python3.11/site-packages/torch/utils/_device.py", line 106, in __torch_function__
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 70.00 MiB. GPU 0 has a total capacity of 44.53 GiB of which 15.94 MiB is free. Process 1334940 has 44.51 GiB memory in use. Of the allocated memory 44.17 GiB is allocated by PyTorch, and 14.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
;model bge-large-zh-v1.5@hf(172.17.0.2:8002) start successfully
INFO: Shutting down