使用RoboBrain2.0-7B的量化模型RoboBrain2.0-7B-Q8_0.gguf

部署运行你感兴趣的模型镜像

原先是想通过vllm来本地部署原生的RoboBrain2.0-7B模型,但头一次接触,不太懂参数含义,运行时报错,报错如下:

(ame) dora@dora:~/adora_ws$ python -m vllm.entrypoints.openai.api_server \
    --model ./RoboBrain2.0-7B \
    --gpu-memory-utilization 0.88 \
    --max-model-len 8192 \
    --max-num-seqs 128 \
    --port 4567 \
    --trust-remote-code \
    --enable-chunked-prefill \
    --pipeline-parallel-size 1
INFO 08-02 16:50:03 [__init__.py:235] Automatically detected platform cuda.
INFO 08-02 16:50:04 [api_server.py:1758] vLLM API server version 0.10.0rc3.dev15+g5a19a6c67.d20250801
INFO 08-02 16:50:04 [cli_args.py:264] non-default args: {'port': 4567, 'model': './RoboBrain2.0-7B', 'trust_remote_code': True, 'max_model_len': 8192, 'gpu_memory_utilization': 0.88, 'max_num_seqs': 128, 'enable_chunked_prefill': True}
INFO 08-02 16:50:07 [config.py:1605] Using max model len 8192
INFO 08-02 16:50:07 [config.py:2416] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 08-02 16:50:09 [__init__.py:235] Automatically detected platform cuda.
INFO 08-02 16:50:10 [core.py:574] Waiting for init message from front-end.
INFO 08-02 16:50:10 [core.py:72] Initializing a V1 LLM engine (v0.10.0rc3.dev15+g5a19a6c67.d20250801) with config: model='./RoboBrain2.0-7B', speculative_config=None, tokenizer='./RoboBrain2.0-7B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=./RoboBrain2.0-7B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":256,"local_cache_dir":null}
[W802 16:50:11.690758880 ProcessGroupNCCL.cpp:915] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 08-02 16:50:11 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 08-02 16:50:12 [profiling.py:276] The sequence length (8192) is smaller than the pre-defined worst-case total number of multimodal tokens (32768). This may cause certain multi-modal inputs to fail during inference. To avoid this, you should increase `max_model_len` or reduce `mm_counts`.
WARNING 08-02 16:50:12 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 08-02 16:50:12 [gpu_model_runner.py:1843] Starting to load model ./RoboBrain2.0-7B...
INFO 08-02 16:50:12 [gpu_model_runner.py:1875] Loading model from scratch...
WARNING 08-02 16:50:12 [vision.py:91] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
INFO 08-02 16:50:12 [cuda.py:290] Using Flash Attention backend on V1 engine.
ERROR 08-02 16:50:12 [core.py:634] EngineCore failed to start.
ERROR 08-02 16:50:12 [core.py:634] Traceback (most recent call last):
ERROR 08-02 16:50:12 [core.py:634]   File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/v1/engine/core.py", line 625, in run_engine_core
ERROR 08-02 16:50:12 [core.py:634]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 08-02 16:50:12 [core.py:634]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-02 16:50:12 [core.py:634]   File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/v1/engine/core.py", line 443, in __init__
ERROR 08-02 16:50:12 [core.py:634]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 08-02 16:50:12 [core.py:634]   File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/v1/engine/core.py", line 78, in __init__
ERROR 08-02 16:50:12 [core.py:634]     self.model_executor = executor_class(vllm_config)
ERROR 08-02 16:50:12 [core.py:634]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-02 16:50:12 [core.py:634]   File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/executor/executor_base.py", line 53, in __init__
ERROR 08-02 16:50:12 [core.py:634]     self._init_executor()
ERROR 08-02 16:50:12 [core.py:634]   File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/executor/uniproc_executor.py", line 49, in _init_executor
ERROR 08-02 16:50:12 [core.py:634]     self.collective_rpc("load_model")
ERROR 08-02 16:50:12 [core.py:634]   File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
ERROR 08-02 16:50:12 [core.py:634]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 08-02 16:50:12 [core.py:634]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-02 16:50:12 [core.py:634]   File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/utils/__init__.py", line 2986, in run_method
ERROR 08-02 16:50:12 [core.py:634]     return func(*args, **kwargs)
ERROR 08-02 16:50:12 [core.py:634]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 08-02 16:50:12 [core.py:634]   File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/v1/worker/gpu_worker.py", line 201, in load_model
ERROR 08-02 16:50:12 [core.py:634]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
ERROR 08-02 16:50:12 [core.py:634]   File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/v1/worker/gpu_model_runner.py", line 1876, in load_model
ERROR 08-02 16:50:12 [core.py:634]     self.model = model_loader.load_model(
ERROR 08-02 16:50:12 [core.py:634]                  ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-02 16:50:12 [core.py:634]   File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/model_executor/model_loader/base_loader.py", line 44, in load_model
ERROR 08-02 16:50:12 [core.py:634]     model = initialize_model(vllm_config=vllm_config,
ERROR 08-02 16:50:12 [core.py:634]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-02 16:50:12 [core.py:634]   File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/model_executor/model_loader/utils.py", line 67, in initialize_model
ERROR 08-02 16:50:12 [core.py:634]     return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 08-02 16:50:12 [core.py:634]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-02 16:50:12 [core.py:634]   File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/model_executor/models/qwen2_5_vl.py", line 862, in __init__
ERROR 08-02 16:50:12 [core.py:634]     self.language_model = init_vllm_registered_model(
ERROR 08-02 16:50:12 [core.py:634]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-02 16:50:12 [core.py:634]   File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/model_executor/models/utils.py", line 316, in init_vllm_registered_model
ERROR 08-02 16:50:12 [core.py:634]     return initialize_model(vllm_config=vllm_config, prefix=prefix)
ERROR 08-02 16:50:12 [core.py:634]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-02 16:50:12 [core.py:634]   File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/model_executor/model_loader/utils.py", line 67, in initialize_model
ERROR 08-02 16:50:12 [core.py:634]     return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 08-02 16:50:12 [core.py:634]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-02 16:50:12 [core.py:634]   File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 454, in __init__
ERROR 08-02 16:50:12 [core.py:634]     self.lm_head = ParallelLMHead(config.vocab_size,
ERROR 08-02 16:50:12 [core.py:634]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-02 16:50:12 [core.py:634]   File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/model_executor/layers/vocab_parallel_embedding.py", line 449, in __init__
ERROR 08-02 16:50:12 [core.py:634]     super().__init__(num_embeddings, embedding_dim, params_dtype,
ERROR 08-02 16:50:12 [core.py:634]   File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/model_executor/layers/vocab_parallel_embedding.py", line 266, in __init__
ERROR 08-02 16:50:12 [core.py:634]     self.quant_method.create_weights(self,
ERROR 08-02 16:50:12 [core.py:634]   File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/model_executor/layers/vocab_parallel_embedding.py", line 34, in create_weights
ERROR 08-02 16:50:12 [core.py:634]     weight = Parameter(torch.empty(sum(output_partition_sizes),
ERROR 08-02 16:50:12 [core.py:634]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-02 16:50:12 [core.py:634]   File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/torch/utils/_device.py", line 103, in __torch_function__
ERROR 08-02 16:50:12 [core.py:634]     return func(*args, **kwargs)
ERROR 08-02 16:50:12 [core.py:634]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 08-02 16:50:12 [core.py:634] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.02 GiB. GPU 0 has a total capacity of 15.48 GiB of which 251.69 MiB is free. Process 3061 has 195.38 MiB memory in use. Including non-PyTorch memory, this process has 15.01 GiB memory in use. Of the allocated memory 14.63 GiB is allocated by PyTorch, and 161.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Process EngineCore_0:
Traceback (most recent call last):
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/v1/engine/core.py", line 638, in run_engine_core
    raise e
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/v1/engine/core.py", line 625, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/v1/engine/core.py", line 443, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/v1/engine/core.py", line 78, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/executor/executor_base.py", line 53, in __init__
    self._init_executor()
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/executor/uniproc_executor.py", line 49, in _init_executor
    self.collective_rpc("load_model")
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/utils/__init__.py", line 2986, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/v1/worker/gpu_worker.py", line 201, in load_model
    self.model_runner.load_model(eep_scale_up=eep_scale_up)
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/v1/worker/gpu_model_runner.py", line 1876, in load_model
    self.model = model_loader.load_model(
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/model_executor/model_loader/base_loader.py", line 44, in load_model
    model = initialize_model(vllm_config=vllm_config,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/model_executor/model_loader/utils.py", line 67, in initialize_model
    return model_class(vllm_config=vllm_config, prefix=prefix)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/model_executor/models/qwen2_5_vl.py", line 862, in __init__
    self.language_model = init_vllm_registered_model(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/model_executor/models/utils.py", line 316, in init_vllm_registered_model
    return initialize_model(vllm_config=vllm_config, prefix=prefix)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/model_executor/model_loader/utils.py", line 67, in initialize_model
    return model_class(vllm_config=vllm_config, prefix=prefix)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 454, in __init__
    self.lm_head = ParallelLMHead(config.vocab_size,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/model_executor/layers/vocab_parallel_embedding.py", line 449, in __init__
    super().__init__(num_embeddings, embedding_dim, params_dtype,
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/model_executor/layers/vocab_parallel_embedding.py", line 266, in __init__
    self.quant_method.create_weights(self,
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/model_executor/layers/vocab_parallel_embedding.py", line 34, in create_weights
    weight = Parameter(torch.empty(sum(output_partition_sizes),
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/torch/utils/_device.py", line 103, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.02 GiB. GPU 0 has a total capacity of 15.48 GiB of which 251.69 MiB is free. Process 3061 has 195.38 MiB memory in use. Including non-PyTorch memory, this process has 15.01 GiB memory in use. Of the allocated memory 14.63 GiB is allocated by PyTorch, and 161.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W802 16:50:13.071911493 ProcessGroupNCCL.cpp:1521] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py", line 1859, in <module>
    uvloop.run(run_server(args))
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/uvloop-0.21.0-py3.12-linux-x86_64.egg/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/uvloop-0.21.0-py3.12-linux-x86_64.egg/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py", line 1794, in run_server
    await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py", line 1814, in run_server_worker
    async with build_async_engine_client(args, client_config) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py", line 159, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py", line 195, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/v1/engine/async_llm.py", line 163, in from_vllm_config
    return cls(
           ^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/v1/engine/async_llm.py", line 117, in __init__
    self.engine_core = EngineCoreClient.make_async_mp_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/v1/engine/core_client.py", line 98, in make_async_mp_client
    return AsyncMPClient(*client_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/v1/engine/core_client.py", line 677, in __init__
    super().__init__(
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/v1/engine/core_client.py", line 408, in __init__
    with launch_core_engines(vllm_config, executor_class,
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/contextlib.py", line 144, in __exit__
    next(self.gen)
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/v1/engine/utils.py", line 697, in launch_core_engines
    wait_for_engine_startup(
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/vllm-0.10.0rc3.dev15+g5a19a6c67.d20250801-py3.12-linux-x86_64.egg/vllm/v1/engine/utils.py", line 750, in wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

开始以为是环境依赖问题,搜集资料,有人说

1.是升级transformers版本可以解决:关于RTX50系列显卡(5080/5090/5090D)CUDA12.8版本部署vllm服务相关步骤整理_cuda 12.8-优快云博客

https://github.com/vllm-project/vllm/issues/17618https://github.com/vllm-project/vllm/issues/17618

2.也有人说是没安装flash-attn,安装即可:https://github.com/vllm-project/vllm/issues/10791

都尝试过后无果,仔细阅读报错发现是显存不够的原因,这里写的很清楚:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.02 GiB. GPU 0 has a total capacity of 15.48 GiB of which 251.69 MiB is free. Process 3061 has 195.38 MiB memory in use. Including non-PyTorch memory, this process has 15.01 GiB memory in use. Of the allocated memory 14.63 GiB is allocated by PyTorch, and 161.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

分析后初步了解原因是:

一个 7B(70亿)参数的模型,并不意味着它只占用 7GB 的存储空间:

        1)模型权重(Model Weights): 这是最基础的显存占用。模型参数通常以 16 位浮点数(float16 或 bfloat16)加载,每个参数占用 2 个字节。

计算:70 亿参数 × 2 字节/参数 ≈ 14 GB

仅模型权重本身就已经接近 16GB 显存的上限了。

        2)KV 缓存(KV Cache): 在模型进行推理(生成文本)时,为了避免重复计算,需要存储中间状态,这就是 KV 缓存。它的显存占用非常巨大,并且与序列长度批处理大小 (batch size)成正比。

        在我的启动命令中,设置了 --max-model-len 8192 和 --max-num-seqs 128。这是一个非常大的上下文长度和并发请求数配置,所以会导致 KV 缓存占用极高的显存。

        3)框架开销: PyTorch、CUDA 以及 vLLM 框架本身在运行时也需要占用一部分显存。从日志可以看到 Of the allocated memory 14.63 GiB is allocated by PyTorch,这说明在加载模型的过程中,大部分显存已经被占用了。

        4)激活值和梯度: 在模型前向传播过程中产生的中间计算结果(激活值)也需要显存。

总结就是14GB (模型权重) + 巨大的 KV 缓存 (由 max-model-len 和 max-num-seqs 决定) + 框架开销 > 16GB。这就是为什么我这点16GB的显存会溢出的根本原因。实在是!哎 :(

在了解了这些参数和信息后,原生的7B模型就不够用了,所以需要寻找量化后的7B模型,在huggingface上找到一个量化后的,尝试部署。

https://huggingface.co/bartowski/BAAI_RoboBrain2.0-7B-GGUFhttps://huggingface.co/bartowski/BAAI_RoboBrain2.0-7B-GGUF

所以现在我们需要安装能运行 GGUF 模型的工具。

安装 `llama-cpp-python`,并确保它能使用 NVIDIA GPU 

CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

 开始下载编译llama-cpp工具,但是最后报错
 

[177/177] : && /usr/bin/g++  -pthread -B /home/dora/miniconda3/envs/ame/compiler_compat -O3 -DNDEBUG  vendor/llama.cpp/tools/mtmd/CMakeFiles/llama-mtmd-cli.dir/mtmd-cli.cpp.o -o vendor/llama.cpp/tools/mtmd/llama-mtmd-cli  -Wl,-rpath,/tmp/tmpicd2up21/build/vendor/llama.cpp/tools/mtmd:/tmp/tmpicd2up21/build/bin:  vendor/llama.cpp/common/libcommon.a  vendor/llama.cpp/tools/mtmd/libmtmd.so  bin/libllama.so  bin/libggml.so  bin/libggml-cpu.so  bin/libggml-cuda.so  bin/libggml-base.so  /usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libcuda.so && :
      FAILED: vendor/llama.cpp/tools/mtmd/llama-mtmd-cli
      : && /usr/bin/g++  -pthread -B /home/dora/miniconda3/envs/ame/compiler_compat -O3 -DNDEBUG  vendor/llama.cpp/tools/mtmd/CMakeFiles/llama-mtmd-cli.dir/mtmd-cli.cpp.o -o vendor/llama.cpp/tools/mtmd/llama-mtmd-cli  -Wl,-rpath,/tmp/tmpicd2up21/build/vendor/llama.cpp/tools/mtmd:/tmp/tmpicd2up21/build/bin:  vendor/llama.cpp/common/libcommon.a  vendor/llama.cpp/tools/mtmd/libmtmd.so  bin/libllama.so  bin/libggml.so  bin/libggml-cpu.so  bin/libggml-cuda.so  bin/libggml-base.so  /usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libcuda.so && :
      /home/dora/miniconda3/envs/ame/compiler_compat/ld: warning: libgomp.so.1, needed by bin/libggml-cpu.so, not found (try using -rpath or -rpath-link)
      /home/dora/miniconda3/envs/ame/compiler_compat/ld: warning: libdl.so.2, needed by /usr/local/cuda-12.8/lib64/libcudart.so.12, not found (try using -rpath or -rpath-link)
      /home/dora/miniconda3/envs/ame/compiler_compat/ld: warning: libpthread.so.0, needed by /usr/local/cuda-12.8/lib64/libcudart.so.12, not found (try using -rpath or -rpath-link)
      /home/dora/miniconda3/envs/ame/compiler_compat/ld: warning: librt.so.1, needed by /usr/local/cuda-12.8/lib64/libcudart.so.12, not found (try using -rpath or -rpath-link)
      /home/dora/miniconda3/envs/ame/compiler_compat/ld: bin/libggml-cpu.so: undefined reference to `GOMP_barrier@GOMP_1.0'
      /home/dora/miniconda3/envs/ame/compiler_compat/ld: bin/libggml-cpu.so: undefined reference to `GOMP_parallel@GOMP_4.0'
      /home/dora/miniconda3/envs/ame/compiler_compat/ld: bin/libggml-cpu.so: undefined reference to `omp_get_thread_num@OMP_1.0'
      /home/dora/miniconda3/envs/ame/compiler_compat/ld: bin/libggml-cpu.so: undefined reference to `GOMP_single_start@GOMP_1.0'
      /home/dora/miniconda3/envs/ame/compiler_compat/ld: bin/libggml-cpu.so: undefined reference to `omp_get_num_threads@OMP_1.0'
      collect2: error: ld returned 1 exit status
      ninja: build stopped: subcommand failed.
      
      
      *** CMake build failed
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Failed to build installable wheels for some pyproject.toml based projects (llama-cpp-python)

找到这几行关键日志报错如下:

/home/dora/miniconda3/envs/ame/compiler_compat/ld: warning: libgomp.so.1, ... not found
ld: bin/libggml-cpu.so: undefined reference to 'GOMP_barrier@GOMP_1.0'
ld: bin/libggml-cpu.so: undefined reference to 'GOMP_parallel@GOMP_4.0'

看样子是缺了这个库,所以尝试安装

conda install -c conda-forge libgomp

发现这个库已经安装过了,那么应该就是链接时出现了问题,我们在系统中找一下这个库

查找命令 
find /usr -name libgomp.so.1
找到内容 
/usr/lib/x86_64-linux-gnu/libgomp.so.1

 找到了这个库,说明哭确实存在只不过编译的时候没找到,所以我们显示的链接一下这个库并再次编译。记得吧这里的路径替换成上面找到的路径,不过一般都是在/usr/lib/x86_64-linux-gnu下。

然后在执行安装命令前, 输入如下命令并回车, 指定 LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

现在编译完成,执行了运行模型的代码,但是仍然报错,报错如下:

(ame) dora@dora:~/adora_ws$ python -m llama_cpp.server \
    --model ./BAAI_RoboBrain2.0-7B-Q8_0.gguf \
    --n_gpu_layers -1 \
    --n_ctx 4096 \
    --port 4567
Traceback (most recent call last):
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/llama_cpp/_ctypes_extensions.py", line 67, in load_shared_library
    return ctypes.CDLL(str(lib_path), **cdll_args)  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/ctypes/__init__.py", line 379, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: /home/dora/miniconda3/envs/ame/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/llama_cpp/lib/libggml-cuda.so)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 112, in _get_module_details
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/llama_cpp/__init__.py", line 1, in <module>
    from .llama_cpp import *
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/llama_cpp/llama_cpp.py", line 38, in <module>
    _lib = load_shared_library(_lib_base_name, _base_path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/llama_cpp/_ctypes_extensions.py", line 69, in load_shared_library
    raise RuntimeError(f"Failed to load shared library '{lib_path}': {e}")
RuntimeError: Failed to load shared library '/home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/llama_cpp/lib/libllama.so': /home/dora/miniconda3/envs/ame/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/dora/miniconda3/envs/ame/lib/python3.12/site-packages/llama_cpp/lib/libggml-cuda.so)
(ame) dora@dora:~/adora_ws$ 

 报错说明的很明确,缺少这个libstdc++.so.6:库 版本是version `GLIBCXX_3.4.30。这个库是老朋友了,是 C++ 的标准库,之前交叉编译aarch64的dora-rs的时候就出现过,当时的解决方法是下了新的gcc,然后设置了软链接来通过编译,但我们现在不需要多个版本gcc做一些交叉编译,所以直接升级下gcc即可。

conda install -c conda-forge gcc gxx -y

我们要把之前编译的、可能链接到旧库的 llama-cpp-python 清理干净

pip uninstall llama-cpp-python -y

然后再次执行,下载编译命令。

CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir

 而后在执行时又遇到了很多依赖包的问题,才发现自己忘记执行安装依赖的命令了,所以执行如下命令安装好全部依赖。

pip install "llama-cpp-python[server]"

再再再再次执行运行模型命令

python -m llama_cpp.server \
    --model ./BAAI_RoboBrain2.0-7B-Q8_0.gguf \
    --n_gpu_layers -1 \
    --n_ctx 4096 \
    --port 4567

成功运行了,这个量化后的RoboBrain2.0-7B模型。保持当前终端不关闭,开启新的终端发送如下命令测试:

curl http://localhost:4567/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "./BAAI_RoboBrain2.0-7B-Q8_0.gguf",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful robot assistant."
      },
      {
        "role": "user",
        "content": "pick up the red block from the table"
      }
    ],
    "temperature": 0.7
  }'

 非常不错,成功运行,测试通过。这段日志的输出表示

  1. llama_perf_context_print: load time = 236.81 ms

    • 加载时间: 初始化模型上下文花费了 236.81 毫秒。这是一个非常快的速度。
  2. llama_perf_context_print: prompt eval time = 236.72 ms / 28 tokens ( 8.45 ms per token, 118.28 tokens per second)

    • 提示处理速度 (Prefill): 处理输入的提示("You are a helpful robot assistant." + "pick up the red block from the table")总共 28 个词元(token),花费了 236.72 毫秒。
    • 速度: 这部分的速度达到了 118.28 tokens/s。这个阶段通常是并行计算的,速度非常快。
  3. llama_perf_context_print: eval time = 680.98 ms / 43 runs ( 15.84 ms per token, 63.14 tokens per second)

    • 生成速度 (Decoding): 模型生成回答的过程。它生成了 43 个词元的回答,花费了 680.98 毫秒。
    • 速度: 这部分的速度是 63.14 tokens/s。这是衡量一个大模型在特定硬件上推理性能的核心指标。对于一个 7B 模型在消费级显卡上,这是一个极其出色的速度!完全可以满足流畅的实时交互需求。
  4. llama_perf_context_print: total time = 991.89 ms / 71 tokens

    • 总时间: 从收到请求到生成完整回答,总共花费了不到 1 秒的时间。
  5. INFO: 127.0.0.1:44610 - "POST /v1/chat/completions HTTP/1.1" 200 OK

    • 最终确认: 服务器确认已经成功处理了您的 POST 请求,并返回了 200 OK 状态码。这意味着在您另一个 curl 命令的终端里,已经打印出了模型的回答。

OK!现在我们拥有一个性能表现优异、配置正确、运行稳定的本地LLM模型推理服务。整个过程从 vLLM 的显存溢出开始,到最终在 llama.cpp 上获得超过 60 tokens/s 的解码速度,模型部署的全流程已经跑通,接下来,我们就连上Adora机器人测试一下!

您可能感兴趣的与本文相关的镜像

Vllm-v0.11.0

Vllm-v0.11.0

Vllm

vLLM是伯克利大学LMSYS组织开源的大语言模型高速推理框架,旨在极大地提升实时场景下的语言模型服务的吞吐与内存使用效率。vLLM是一个快速且易于使用的库,用于 LLM 推理和服务,可以和HuggingFace 无缝集成。vLLM利用了全新的注意力算法「PagedAttention」,有效地管理注意力键和值

<think>嗯,用户询问的是deepseek-7b base模型训练所需的显存大小。这个问题涉及到深度学习模型训练的资源评估,需要从模型参数、优化器状态、梯度存储等角度分析。 用户提到的引用资料其实有些偏差:引用[1]提到1.5b蒸馏模型,引用[2]讨论的是7b-qwen-distill-q4_K_M量化模型,但用户核心需求是原始7B模型的训练显存。这说明用户可能混淆了模型版本,需要明确区分训练和推理、基础模型量化模型。 先计算基础参数占用量:7B模型70亿参数,按FP32格式计算是70亿*4字节=28GB,但这只是最小理论值。实际训练需要考虑: - 优化器状态:Adam优化器需要保存参数和动量,每个参数额外占用8字节(参数副本+动量+方差) - 梯度存储:反向传播产生的梯度与参数同尺寸 因此总显存≈参数量*(参数大小+梯度大小+优化器状态)=70亿*(4+4+12)=140GB 不过这里有几个关键变量: 1 精度选择:FP16/混合精度训练可减半存储 2 梯度检查点:用时间换空间,显著降低显存 3 模型并行:将模型拆分到多卡 4 批次大小:直接影响激活值存储量 用户没有说明硬件条件,所以需要补充典型场景的估算: - 单卡A100 80GB:必须使用ZeRO优化或梯度检查点 - 8卡集群:可通过数据并行降低单卡负载 特别注意激活值在训练中的显存占比可能高达30%,尤其大批次场景 考虑到用户引用的是个人笔记本安装量化模型的案例,可能实际资源有限。需要强调7B基础模型训练对消费级显卡极不现实,建议考虑: - 使用DeepSeek官方云API - 训练小规模下游任务时采用LoRA等微调技术 - 直接使用已发布的预训练模型 最后生成相关问题时要覆盖用户可能延伸的困惑点,包括量化推理、微调方案、硬件适配等常见痛点。 </think>使用DeepSeek-7B base模型训练时所需的显存大小主要取决于以下几个方面(以FP32精度为例): ### 显存占用计算公式 ```math \text{总显存} = \text{模型参数显存} + \text{优化器状态显存} + \text{梯度显存} + \text{激活值显存} ``` ### 详细估算(FP32精度) 1. **模型参数** $7B \times 4\text{bytes} = 28\text{GB}$ (每个参数占4字节) 2. **优化器状态**(以Adam为例) - 参数副本:$7B \times 4 = 28\text{GB}$ - 动量(momentum):$7B \times 4 = 28\text{GB}$ - 方差(variance):$7B \times 4 = 28\text{GB}$ $$ \text{合计:} 28 \times 3 = 84\text{GB} $$ 3. **梯度存储** $7B \times 4\text{bytes} = 28\text{GB}$ 4. **激活值**(与批次大小/序列长度相关) 典型场景下约占参数显存的1.5-2倍:$42\text{-}56\text{GB}$ ### 总显存需求(理论最小值) $$ \begin{align*} \text{最低需求} &= 28\text{(参数)} + 84\text{(优化器)} + 28\text{(梯度)} + 42\text{(激活)} \\ &= 182\text{GB} \end{align*} $$ ### 实际训练建议 | 优化手段 | 显存占用 | 说明 | |-------------------|---------------|--------------------------| | FP32原生训练 | ≥180GB | 需多卡并行 | | **混合精度训练** | **90-120GB** | 推荐方案,显存减半 [^1] | | 梯度检查点 | 降低30%-50% | 时间换空间策略 [^2] | | 8-bit优化器 | 约70GB | 如bitsandbytes方案 | ### 硬件配置参考 1. **单卡训练**:需至少1×A100 80GB(需开启ZeRO-Offload) 2. **标准方案**:4×A100 40GB(通过数据并行+梯度累积) 3. **经济方案**:2×RTX 4090 24GB(需8-bit优化+梯度检查点) > 📌 **重点提示**:实际占用受批次大小(batch size)、序列长度影响显著。若使用官方提供的**量化版本**(如q4_K_M),推理显存可降至6-8GB,但训练仍需完整精度支持[^2]。 --- ### 相关问题 1. 如何在消费级显卡上微调DeepSeek-7B模型2. 混合精度训练对模型效果会产生什么影响? 3. 梯度检查点技术的工作原理是什么? 4. 除了显存,训练7B模型还需要哪些硬件资源? 5. DeepSeek不同量化版本(q4/q6/q8)的显存和性能差异? [^1]: 混合精度通过FP16计算+FP32主权重实现显存优化 [^2]: 量化模型仅适用于推理,训练仍需完整参数更新
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值