论文阅读:Reward-Guided Speculative Decoding for Efficient LLM Reasoning

Speculative Decoding的不足

speculative decoding虽然可以加速LLM inference,其要求严格的无偏性,这在概率上和target model完全一致,理论上不会降低生成质量,但是由于严格无偏性的条件过于苛刻,这会导致draft model生成的一些高质量的token即使正确,由于分布差异或概率稍低而被拒绝,进而造成额外的target model的调用代价,这个问题在复杂推理中更为严重。

RSD概括

RSD的核心想法:放宽draft model 生成内容的接受条件,允许有偏接受策略
具体而言,通过引入一个process reward model,以reward作为接受/拒绝的条件,而不是严格按照概率分布匹配。
这样做的好处也很显然,生成内容接受条件放宽必然带来更多token的接受,进而减少target model调用的次数,同时有process reward model的监督也能保证生成质量。
在这里插入图片描述

这张图可以分为上下两部分看,上半部分是传统Speculative Decoding,下半部分是论文提出的RSD
可以看到上半部分对于draft model生成的每一份draft,都需要调用target model进行验证,而下半部分RSD则是根据reward决定是否调用target model进行生成。

具体细节

RSD算法流程

请添加图片描述
在这里插入图片描述

token接受

token接受的reward阈值。
这里是通过一个 ω ( ⋅ ) \omega(\cdot) ω()的权重函数,将reward映射到 [ 0 , 1 ] [0,1] [0,1],进而可以使用采样方法。
w ( y i ∣ z i ) = ω r ( y i ∣ z i ) = ω ( r ( y i ∣ z i ) ) w(y_i \mid z_i) = \omega_r(y_i \mid z_i) = \omega(r(y_i \mid z_i)) w(yizi)=ωr(yizi)=ω(r(yizi))
在这里插入图片描述
算法2,这里对于 ω ( ⋅ ) \omega(\cdot) ω()允许多种实现方式,如阶跃函数或更平滑的函数
对于不同的权重函数作者也讨论了不同实现的好处,同时在table1中给出了不同的权重函数。

  • 对于阶跃函数在理论上最优
  • 对于其他函数则更加平滑

在这里插入图片描述
这里二元阶跃函数是最优的。

RSD概率分布理论分析

进行相关符号的规定:

符号内容
x ∈ R l × d x \in \mathbb{R}^{l \times d} xRl×dprompt
y ∈ R L × d y \in \mathbb{R}^{L \times d} yRL×dreponse
y 1 : n y_{1:n} y1:n [ y 1 , . . . , y n ] [y_1,...,y_n] [y1,...,yn]
z i z_i zi [ x , y 1 : i − 1 ] [x,y_{1:i-1}] [x,y1:i1]
mdraft model
Mtarget model
P m ( y i ∣ z i ) P_m(y_i|z_i) Pm(yizi)the distribution of draft model sample
P M ( y i ∣ z i ) P_M(y_i|z_i) PM(yizi)the distribution of target model sample
r ( y i ∣ z i ) = r ( y i ∣ x , y 1 : i − 1 ) r(y_i|z_i)=r(y_i|x,y_{1:i-1}) r(yizi)=r(yix,y1:i1)reward function

较高的奖励值 r ( y i ∣ z i ) r(y_i \mid z_i) r(yizi) 表示,在给定输入 x x x 和之前已生成的步骤 y 1 : i − 1 y_{1:i-1} y1:i1 的情况下,该模型输出与期望的响应更契合的可能性更大。
所以target model M的expect reward应该是大于draft model m
E y i ∼ P M [ r ( y i ∣ z i ) ] ≥ E y i ∼ P m [ r ( y i ∣ z i ) ] , ( 1 ) \mathbb{E}_{y_i \sim \mathbf{P}_M}[r(y_i|z_i)] \geq \mathbb{E}_{y_i \sim \mathbf{P}_{m}}[r(y_i|z_i)], \quad (1) EyiPM[r(yizi)]EyiPm[r(yizi)],(1)

理论分析所提出方法的RSD的分布 P R S D P_{RSD} PRSD,其由 P m 、 P M P_m、P_M PmPM结合
P R S D ( y i ∣ z i ) = w ( y i ∣ z i ) P m ( y i ∣ z i ) + v ( y i ∣ z i ) P M ( y i ∣ z i ) \mathbf{P}_{\mathrm{RSD}}(y_i \mid z_i) = w(y_i \mid z_i) \mathbf{P}_m(y_i \mid z_i) + v(y_i \mid z_i) \mathbf{P}_M(y_i \mid z_i) PRSD(yizi)=w(yizi)Pm(yizi)+v(yizi)PM(yizi)
其中 w ( y i ∣ z i ) w(y_i \mid z_i) w(yizi)是一个权重函数根据draft model输出的reward动态调整,而 v ( y i ∣ z i ) v(y_i \mid z_i) v(yizi)是一个常量,这保证了target model永远都有一部分参与,不至于完全依赖draft model出现断层,换句话始终有target model来兜底。
在这里插入图片描述

其他

只有在target model显著大于PRM时这个方法的效果才会比较明显,因为这里虽然减少了target model的验证步,但是引入了新的开销,PRM获得reward。

DB-GPT输出如下所示: =========================== VLLMDeployModelParameters =========================== name: DeepSeek-R1-Distill-Qwen-32B provider: vllm verbose: False concurrency: 100 backend: None prompt_template: None context_length: None reasoning_model: None path: models/DeepSeek-R1-Distill-Qwen-32B device: auto trust_remote_code: True download_dir: None load_format: auto config_format: auto dtype: auto kv_cache_dtype: auto seed: 0 max_model_len: None distributed_executor_backend: None pipeline_parallel_size: 1 tensor_parallel_size: 4 max_parallel_loading_workers: None block_size: None enable_prefix_caching: None swap_space: 4.0 cpu_offload_gb: 0.0 gpu_memory_utilization: 0.9 max_num_batched_tokens: None max_num_seqs: None max_logprobs: 20 revision: None code_revision: None tokenizer_revision: None tokenizer_mode: auto quantization: fp8 max_seq_len_to_capture: 8192 worker_cls: auto extras: None ====================================================================== 2025-08-05 06:33:38 1249bbe41ac7 dbgpt.util.code.server[15208] INFO Code server is ready INFO 08-05 06:33:42 config.py:520] This model supports multiple tasks: {'classify', 'embed', 'reward', 'score', 'generate'}. Defaulting to 'generate'. INFO 08-05 06:33:42 config.py:1328] Defaulting to use ray for distributed inference WARNING 08-05 06:33:42 arg_utils.py:1107] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False. INFO 08-05 06:33:42 config.py:1483] Chunked prefill is enabled with max_num_batched_tokens=2048. INFO 08-05 06:33:43 llm_engine.py:232] Initializing an LLM engine (v0.7.0) with config: model='/app/models/DeepSeek-R1-Distill-Qwen-32B', speculative_config=None, tokenizer='/app/models/DeepSeek-R1-Distill-Qwen-32B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/app/models/DeepSeek-R1-Distill-Qwen-32B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 2025-08-05 06:33:43,755 WARNING worker.py:1395 -- SIGTERM handler is not set because current thread is not the main thread. 2025-08-05 06:33:46,674 INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 WARNING 08-05 06:33:48 ray_utils.py:315] The number of required GPUs exceeds the total number of available GPUs in the placement group. INFO 08-05 06:33:58 ray_utils.py:212] Waiting for creating a placement group of specs for 10 seconds. specs=[{'GPU': 1.0, 'node:172.17.0.2': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources. INFO 08-05 06:34:18 ray_utils.py:212] Waiting for creating a placement group of specs for 30 seconds. specs=[{'GPU': 1.0, 'node:172.17.0.2': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources.
08-06
(vllm) zhzx@zhzx-S2600WF-LS:/media/zhzx/ssd2/Qwen3-32B-AWQ$ CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.api_server \ --model /media/zhzx/ssd2/Qwen3-32B-AWQ \ --tensor-parallel-size 2 \ --quantization awq \ --trust-remote-code INFO 05-12 11:01:48 [__init__.py:239] Automatically detected platform cuda. INFO 05-12 11:01:50 [api_server.py:121] vLLM API server version 0.8.5.post1 INFO 05-12 11:01:50 [api_server.py:122] args: Namespace(host=None, port=8000, ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, log_level='debug', model='/media/zhzx/ssd2/Qwen3-32B-AWQ', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=None, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization='awq', rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False) INFO 05-12 11:01:56 [config.py:717] This model supports multiple tasks: {'score', 'reward', 'classify', 'embed', 'generate'}. Defaulting to 'generate'. WARNING 05-12 11:01:57 [config.py:830] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models. WARNING 05-12 11:01:57 [arg_utils.py:1658] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0. WARNING 05-12 11:01:57 [arg_utils.py:1525] Chunked prefill is enabled by default for models with max_model_len > 32K. Chunked prefill might not work with some features or models. If you encounter any issues, please disable by launching with --enable-chunked-prefill=False. INFO 05-12 11:01:57 [config.py:1770] Defaulting to use mp for distributed inference INFO 05-12 11:01:57 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048. INFO 05-12 11:01:57 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/media/zhzx/ssd2/Qwen3-32B-AWQ', speculative_config=None, tokenizer='/media/zhzx/ssd2/Qwen3-32B-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=40960, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/media/zhzx/ssd2/Qwen3-32B-AWQ, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, WARNING 05-12 11:01:57 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. (VllmWorkerProcess pid=100643) INFO 05-12 11:01:57 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks INFO 05-12 11:01:57 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 05-12 11:01:57 [cuda.py:289] Using XFormers backend. (VllmWorkerProcess pid=100643) INFO 05-12 11:01:57 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. (VllmWorkerProcess pid=100643) INFO 05-12 11:01:57 [cuda.py:289] Using XFormers backend. INFO 05-12 11:01:58 [utils.py:1055] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=100643) INFO 05-12 11:01:58 [utils.py:1055] Found nccl from library libnccl.so.2 INFO 05-12 11:01:58 [pynccl.py:69] vLLM is using nccl==2.21.5 (VllmWorkerProcess pid=100643) INFO 05-12 11:01:58 [pynccl.py:69] vLLM is using nccl==2.21.5 (VllmWorkerProcess pid=100643) INFO 05-12 11:01:58 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/zhzx/.cache/vllm/gpu_p2p_access_cache_for_0,1.json INFO 05-12 11:01:58 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/zhzx/.cache/vllm/gpu_p2p_access_cache_for_0,1.json INFO 05-12 11:01:58 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_91501777'), local_subscribe_addr='ipc:///tmp/c7c55d66-4bbe-451b-a9f3-3133e26fdb9b', remote_subscribe_addr=None, remote_addr_ipv6=False) (VllmWorkerProcess pid=100643) INFO 05-12 11:01:58 [parallel_state.py:1004] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1 INFO 05-12 11:01:58 [parallel_state.py:1004] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0 INFO 05-12 11:01:58 [model_runner.py:1108] Starting to load model /media/zhzx/ssd2/Qwen3-32B-AWQ... (VllmWorkerProcess pid=100643) INFO 05-12 11:01:58 [model_runner.py:1108] Starting to load model /media/zhzx/ssd2/Qwen3-32B-AWQ... Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:01<00:04, 1.51s/it] Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:02<00:02, 1.36s/it] Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:04<00:01, 1.52s/it] Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00, 1.41s/it] Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00, 1.43s/it] (VllmWorkerProcess pid=100643) INFO 05-12 11:02:04 [loader.py:458] Loading weights took 5.72 seconds INFO 05-12 11:02:04 [loader.py:458] Loading weights took 5.79 seconds INFO 05-12 11:02:05 [model_runner.py:1140] Model loading took 9.0568 GiB and 5.991388 seconds (VllmWorkerProcess pid=100643) INFO 05-12 11:02:05 [model_runner.py:1140] Model loading took 9.0568 GiB and 5.932729 seconds (VllmWorkerProcess pid=100643) INFO 05-12 11:02:10 [worker.py:287] Memory profiling takes 5.32 seconds (VllmWorkerProcess pid=100643) INFO 05-12 11:02:10 [worker.py:287] the current vLLM instance can use total_gpu_memory (23.64GiB) x gpu_memory_utilization (0.90) = 21.27GiB (VllmWorkerProcess pid=100643) INFO 05-12 11:02:10 [worker.py:287] model weights take 9.06GiB; non_torch_memory takes 0.13GiB; PyTorch activation peak memory takes 0.41GiB; the rest of the memory reserved for KV Cache is 11.68GiB. INFO 05-12 11:02:10 [worker.py:287] Memory profiling takes 5.46 seconds INFO 05-12 11:02:10 [worker.py:287] the current vLLM instance can use total_gpu_memory (23.64GiB) x gpu_memory_utilization (0.90) = 21.27GiB INFO 05-12 11:02:10 [worker.py:287] model weights take 9.06GiB; non_torch_memory takes 0.13GiB; PyTorch activation peak memory takes 1.40GiB; the rest of the memory reserved for KV Cache is 10.68GiB. INFO 05-12 11:02:11 [executor_base.py:112] # cuda blocks: 5469, # CPU blocks: 2048 INFO 05-12 11:02:11 [executor_base.py:117] Maximum concurrency for 40960 tokens per request: 2.14x INFO 05-12 11:02:13 [model_runner.py:1450] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. Capturing CUDA graph shapes: 0%| | 0/35 [00:00<?, ?it/s](VllmWorkerProcess pid=100643) INFO 05-12 11:02:13 [model_runner.py:1450] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:29<00:00, 1.18it/s] INFO 05-12 11:02:43 [custom_all_reduce.py:195] Registering 4515 cuda graph addresses (VllmWorkerProcess pid=100643) INFO 05-12 11:02:43 [custom_all_reduce.py:195] Registering 4515 cuda graph addresses (VllmWorkerProcess pid=100643) INFO 05-12 11:02:43 [model_runner.py:1592] Graph capturing finished in 30 secs, took 0.97 GiB INFO 05-12 11:02:43 [model_runner.py:1592] Graph capturing finished in 30 secs, took 0.97 GiB INFO 05-12 11:02:43 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 38.52 seconds INFO 05-12 11:02:43 [launcher.py:28] Available routes are: INFO 05-12 11:02:43 [launcher.py:36] Route: /openapi.json, Methods: GET, HEAD INFO 05-12 11:02:43 [launcher.py:36] Route: /docs, Methods: GET, HEAD INFO 05-12 11:02:43 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: GET, HEAD INFO 05-12 11:02:43 [launcher.py:36] Route: /redoc, Methods: GET, HEAD INFO 05-12 11:02:43 [launcher.py:36] Route: /health, Methods: GET INFO 05-12 11:02:43 [launcher.py:36] Route: /generate, Methods: POST [rank0]: Traceback (most recent call last): [rank0]: File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/starlette/datastructures.py", line 668, in __getattr__ [rank0]: return self._state[key] [rank0]: ~~~~~~~~~~~^^^^^ [rank0]: KeyError: 'engine_client' [rank0]: During handling of the above exception, another exception occurred: [rank0]: Traceback (most recent call last): [rank0]: File "<frozen runpy>", line 198, in _run_module_as_main [rank0]: File "<frozen runpy>", line 88, in _run_code [rank0]: File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/api_server.py", line 177, in <module> [rank0]: asyncio.run(run_server(args)) [rank0]: File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/asyncio/runners.py", line 195, in run [rank0]: return runner.run(main) [rank0]: ^^^^^^^^^^^^^^^^ [rank0]: File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/asyncio/runners.py", line 118, in run [rank0]: return self._loop.run_until_complete(task) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete [rank0]: return future.result() [rank0]: ^^^^^^^^^^^^^^^ [rank0]: File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/api_server.py", line 129, in run_server [rank0]: shutdown_task = await serve_http( [rank0]: ^^^^^^^^^^^^^^^^^ [rank0]: File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/launcher.py", line 46, in serve_http [rank0]: watchdog_loop(server, app.state.engine_client)) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/starlette/datastructures.py", line 671, in __getattr__ [rank0]: raise AttributeError(message.format(self.__class__.__name__, key)) [rank0]: AttributeError: 'State' object has no attribute 'engine_client' [rank0]:[W512 11:02:45.924296727 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) /home/zhzx/miniconda3/envs/vllm/lib/python3.12/multiprocessing/resource_tracker.py:255: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '
05-13
你的 VHDL 代码中存在几个语法错误和逻辑问题,尤其是在常量 `seg_table` 的定义部分缺少逗号分隔、数组类型声明位置不当,以及进程中的信号赋值逻辑(例如最后一个 `LED6<= seg_table(cnt1);` 可能是笔误)。以下是修正后的完整、可综合的 VHDL 代码,并附有详细解释。 --- ```vhdl Module_Decoding : process(cnt1, cnt2, cnt3, cnt4, cnt5, cnt6) type sg_array is array (0 to 9) of std_logic_vector(7 downto 0); constant seg_table : sg_array := ( "11000000", -- 0 "11111001", -- 1 "10100100", -- 2 "10110000", -- 3 "10011001", -- 4 "10010010", -- 5 "10000010", -- 6 "11011000", -- 7 "10000000", -- 8 "10010000" -- 9 ); begin LED1 <= seg_table(to_integer(unsigned(cnt1))); LED2 <= seg_table(to_integer(unsigned(cnt2))); LED3 <= seg_table(to_integer(unsigned(cnt3))); LED4 <= seg_table(to_integer(unsigned(cnt4))); LED5 <= seg_table(to_integer(unsigned(cnt5))); LED6 <= seg_table(to_integer(unsigned(cnt6))); -- 原来是cnt1,推测应为cnt6 end process Module_Decoding; ``` --- ### ✅ **错误与修正说明**: 1. **缺少逗号分隔符**: - 在 `constant seg_table` 中,每个元素之间必须用逗号 `,` 分隔。原代码没有逗号,导致语法错误。 2. **`downto` 拼写错误**: - 原代码写成 `down to`(两个词),正确应为连续的 `downto`。 3. **输入信号类型假设**: - `cnt1`, `cnt2` 等很可能是 `std_logic_vector(3 downto 0)` 或 `unsigned(3 downto 0)` 类型,表示 0~9 的 BCD 编码。 - 要用它们作为数组索引,必须转换为整数:使用 `to_integer(unsigned(...))`。 4. **最后一行逻辑错误**: - `LED6 <= seg_table(cnt1);` 明显是笔误,应该是 `cnt6`,否则 LED6 显示和 LED1 一样。 5. **类型定义放在进程中是合法的(VHDL-2008 支持)**,但某些旧工具可能不支持。更推荐将 `type` 定义在 architecture 外部或 package 中。但如果上下文允许,进程内定义也可以。 6. **需要引入 numeric_std 库**: - 使用 `unsigned` 和 `to_integer` 需要包含标准库: ```vhdl library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; ``` --- ### ✅ **完整建议结构头文件** ```vhdl library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; -- entity ... architecture Behavioral of your_entity is type sg_array is array (0 to 9) of std_logic_vector(7 downto 0); constant seg_table : sg_array := ( "11000000", -- 0 "11111001", -- 1 "10100100", -- 2 "10110000", -- 3 "10011001", -- 4 "10010010", -- 5 "10000010", -- 6 "11011000", -- 7 "10000000", -- 8 "10010000" -- 9 ); begin Module_Decoding : process(cnt1, cnt2, cnt3, cnt4, cnt5, cnt6) begin LED1 <= seg_table(to_integer(unsigned(cnt1))); LED2 <= seg_table(to_integer(unsigned(cnt2))); LED3 <= seg_table(to_integer(unsigned(cnt3))); LED4 <= seg_table(to_integer(unsigned(cnt4))); LED5 <= seg_table(to_integer(unsigned(cnt5))); LED6 <= seg_table(to_integer(unsigned(cnt6))); end process Module_Decoding; -- other logic... end architecture; ``` --- ### 🔍 补充说明:七段数码管编码 你使用的 `"11000000"` 是共阳极七段数码管的典型编码方式(低位对应 a, b, c, ..., 高位可能用于小数点): | 字符 | A | B | C | D | E | F | G | DP | Hex | |------|---|---|---|---|---|---|---|----|-----| | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | C0 | | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | F9? → 实际表中是 `11111001` = F9 | > 注意:不同设计中字段顺序可能不同(如是否包含 DP,顺序是 `GFEDCBA` 还是 `ABCDEFG`)。请确认你的硬件连接。 --- ###
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值