[DeepSeek] gpu_memory_utilization=0.7的时候没有OOM,等gpu_memory_utilization=0.8的时候反而OOM了?

这是一个非常好的问题,它揭示了 gpu_memory_utilization 参数行为的复杂性。在 vLLM 中,从 0.7 增加到 0.8 时发生 OOM(Out Of Memory)是完全可能且合理的,这并非参数失效,而是由其底层设计和工作原理决定的。

核心原因在于:gpu_memory_utilization 不是一个“精确”的内存限制器,而是一个“目标”或“预算”。vLLM 会尝试在这个预算内运行,但实际的内存需求可能因其他因素而超过这个软限制,导致 OOM。

以下是导致这种现象的几个关键原因:


1. 主要原因:KV Cache 的块式内存管理

这是最可能的原因。vLLM 使用 PagedAttention 技术,将 KV Cache 划分为固定大小的块(blocks)。

  • 0.7:vLLM 根据这个“预算”计算出一个它认为安全的块数量(Number of Blocks)。系统在这个块数量下运行稳定,因为总内存使用量(模型权重 + 剩余块内存 + 其他开销)低于可用的物理显存。
  • 当你增加到 0.8:vLLM 会重新计算并分配更多的块。这些额外的块是为了服务更多的并发请求(max_num_seqs)或更长的序列。
  • 问题在于:分配这些新块所需的额外内存,可能 push 了总内存使用量越过了物理显存的临界点。虽然预算只增加了 10%,但新块分配可能是不连续的、额外的开销,导致瞬间的峰值内存需求超过了显卡的实际容量。

类比:你的房间(GPU显存)有 100 平方。你计划用 70 平方(0.7)放

``` #!/bin/bash # This script aims to tune the best server parameter combinations to maximize throughput for given requirement. # See details in README (benchmarks/auto_tune/README.md). TAG=$(date +"%Y_%m_%d_%H_%M") BASE="" MODEL="meta-llama/Llama-3.1-8B-Instruct" SYSTEM="TPU" TP=1 DOWNLOAD_DIR="" INPUT_LEN=4000 OUTPUT_LEN=16 MIN_CACHE_HIT_PCT=0 MAX_LATENCY_ALLOWED_MS=100000000000 NUM_SEQS_LIST="128 256" NUM_BATCHED_TOKENS_LIST="512 1024 2048 4096" LOG_FOLDER="$BASE/auto-benchmark/$TAG" RESULT="$LOG_FOLDER/result.txt" PROFILE_PATH="$LOG_FOLDER/profile" echo "result file: $RESULT" echo "model: $MODEL" rm -rf $LOG_FOLDER rm -rf $PROFILE_PATH mkdir -p $LOG_FOLDER mkdir -p $PROFILE_PATH cd "$BASE/vllm" pip install -q datasets current_hash=$(git rev-parse HEAD) echo "hash:$current_hash" >> "$RESULT" echo "current_hash: $current_hash" best_throughput=0 best_max_num_seqs=0 best_num_batched_tokens=0 best_goodput=0 start_server() { local gpu_memory_utilization=$1 local max_num_seqs=$2 local max_num_batched_tokens=$3 local vllm_log=$4 local profile_dir=$5 pkill -f vllm VLLM_USE_V1=1 VLLM_SERVER_DEV_MODE=1 VLLM_TORCH_PROFILER_DIR=$profile_dir vllm serve $MODEL \ --disable-log-requests \ --port 8004 \ --gpu-memory-utilization $gpu_memory_utilization \ --max-num-seqs $max_num_seqs \ --max-num-batched-tokens $max_num_batched_tokens \ --tensor-parallel-size $TP \ --enable-prefix-caching \ --load-format dummy \ --download-dir "$DOWNLOAD_DIR" \ --max-model-len $(( INPUT_LEN+OUTPUT_LEN )) > "$vllm_log" 2>&1 & # wait for 10 minutes... server_started=0 for i in {1..60}; do RESPONSE=$(curl -s -X GET "http://0.0.0.0:8004/health" -w "%{http_code}" -o /dev/stdout) STATUS_CODE=$(echo "$RESPONSE" | tail -n 1) if [[ "$STATUS_CODE" -eq 200 ]]; then server_started=1 break else sleep 10 fi done if (( ! server_started )); then echo "server did not start within 10 minutes. Please check server log at $vllm_log". return 1 else return 0 fi } update_best_profile() { local profile_dir=$1 local profile_index=$2 sorted_paths=($(find "$profile_dir" -maxdepth 1 -not -path "$profile_dir" | sort)) selected_profile_file= if [[ "$SYSTEM" == "TPU" ]]; then selected_profile_file="${sorted_paths[$profile_index]}/*.xplane.pb" fi if [[ "$SYSTEM" == "GPU" ]]; then selected_profile_file="${sorted_paths[$profile_index]}" fi rm -f $PROFILE_PATH/* cp $selected_profile_file $PROFILE_PATH } run_benchmark() { local max_num_seqs=$1 local max_num_batched_tokens=$2 local gpu_memory_utilization=$3 echo "max_num_seq: $max_num_seqs, max_num_batched_tokens: $max_num_batched_tokens" local vllm_log="$LOG_FOLDER/vllm_log_${max_num_seqs}_${max_num_batched_tokens}.txt" local profile_dir="$LOG_FOLDER/profile_${max_num_seqs}_${max_num_batched_tokens}" echo "vllm_log: $vllm_log" echo rm -f $vllm_log mkdir -p $profile_dir pkill -f vllm local profile_index=0 echo "starting server..." start_server $gpu_memory_utilization $max_num_seqs $max_num_batched_tokens $vllm_log $profile_dir result=$? if [[ "$result" -eq 1 ]]; then echo "server failed to start. gpu_memory_utilization:$gpu_memory_utilization, max_num_seqs:$max_num_seqs, max_num_batched_tokens: $max_num_batched_tokens" else echo "server started." fi echo echo "run benchmark test..." meet_latency_requirement=0 # get a basic qps by using request-rate inf bm_log="$LOG_FOLDER/bm_log_${max_num_seqs}_${max_num_batched_tokens}_requestrate_inf.txt" prefix_len=$(( INPUT_LEN * MIN_CACHE_HIT_PCT / 100 )) adjusted_input_len=$(( INPUT_LEN - prefix_len )) python3 benchmarks/benchmark_serving.py \ --backend vllm \ --model $MODEL \ --dataset-name random \ --random-input-len $adjusted_input_len \ --random-output-len $OUTPUT_LEN \ --ignore-eos \ --disable-tqdm \ --request-rate inf \ --percentile-metrics ttft,tpot,itl,e2el \ --goodput e2el:$MAX_LATENCY_ALLOWED_MS \ --num-prompts 1000 \ --random-prefix-len $prefix_len \ --port 8004 \ --profile &> "$bm_log" throughput=$(grep "Request throughput (req/s):" "$bm_log" | sed 's/[^0-9.]//g') e2el=$(grep "P99 E2EL (ms):" "$bm_log" | awk '{print $NF}') goodput=$(grep "Request goodput (req/s):" "$bm_log" | sed 's/[^0-9.]//g') if (( $(echo "$e2el <= $MAX_LATENCY_ALLOWED_MS" | bc -l) )); then meet_latency_requirement=1 request_rate=inf fi if (( ! meet_latency_requirement )); then # start from request-rate as int(throughput) + 1 request_rate=$((${throughput%.*} + 1)) while ((request_rate > 0)); do profile_index=$((profile_index+1)) # clear prefix cache curl -X POST http://0.0.0.0:8004/reset_prefix_cache sleep 5 bm_log="$LOG_FOLDER/bm_log_${max_num_seqs}_${max_num_batched_tokens}_requestrate_${request_rate}.txt" python3 benchmarks/benchmark_serving.py \ --backend vllm \ --model $MODEL \ --dataset-name random \ --random-input-len $adjusted_input_len \ --random-output-len $OUTPUT_LEN \ --ignore-eos \ --disable-tqdm \ --request-rate $request_rate \ --percentile-metrics ttft,tpot,itl,e2el \ --goodput e2el:$MAX_LATENCY_ALLOWED_MS \ --num-prompts 100 \ --random-prefix-len $prefix_len \ --port 8004 &> "$bm_log" throughput=$(grep "Request throughput (req/s):" "$bm_log" | sed 's/[^0-9.]//g') e2el=$(grep "P99 E2EL (ms):" "$bm_log" | awk '{print $NF}') goodput=$(grep "Request goodput (req/s):" "$bm_log" | sed 's/[^0-9.]//g') if (( $(echo "$e2el <= $MAX_LATENCY_ALLOWED_MS" | bc -l) )); then meet_latency_requirement=1 break fi request_rate=$((request_rate-1)) done fi # write the results and update the best result. if ((meet_latency_requirement)); then echo "max_num_seqs: $max_num_seqs, max_num_batched_tokens: $max_num_batched_tokens, request_rate: $request_rate, e2el: $e2el, throughput: $throughput, goodput: $goodput" echo "max_num_seqs: $max_num_seqs, max_num_batched_tokens: $max_num_batched_tokens, request_rate: $request_rate, e2el: $e2el, throughput: $throughput, goodput: $goodput" >> "$RESULT" if (( $(echo "$throughput > $best_throughput" | bc -l) )); then best_throughput=$throughput best_max_num_seqs=$max_num_seqs best_num_batched_tokens=$max_num_batched_tokens best_goodput=$goodput if [[ "$SYSTEM" == "TPU" ]]; then update_best_profile "$profile_dir/plugins/profile" $profile_index fi if [[ "$SYSTEM" == "GPU" ]]; then update_best_profile "$profile_dir" $profile_index fi fi else echo "max_num_seqs: $max_num_seqs, max_num_batched_tokens: $max_num_batched_tokens does not meet latency requirement ${MAX_LATENCY_ALLOWED_MS}" echo "max_num_seqs: $max_num_seqs, max_num_batched_tokens: $max_num_batched_tokens does not meet latency requirement ${MAX_LATENCY_ALLOWED_MS}" >> "$RESULT" fi echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput" pkill vllm sleep 10 printf '=%.0s' $(seq 1 20) return 0 } read -r -a num_seqs_list <<< "$NUM_SEQS_LIST" read -r -a num_batched_tokens_list <<< "$NUM_BATCHED_TOKENS_LIST" # first find out the max gpu-memory-utilization without HBM OOM. gpu_memory_utilization=0.98 find_gpu_memory_utilization=0 while (( $(echo "$gpu_memory_utilization >= 0.9" | bc -l) )); do start_server $gpu_memory_utilization "${num_seqs_list[-1]}" "${num_batched_tokens_list[-1]}" "$LOG_FOLDER/vllm_log_gpu_memory_utilization_$gpu_memory_utilization.log" result=$? if [[ "$result" -eq 0 ]]; then find_gpu_memory_utilization=1 break else gpu_memory_utilization=$(echo "$gpu_memory_utilization - 0.01" | bc) fi done if [[ "$find_gpu_memory_utilization" -eq 1 ]]; then echo "Using gpu_memory_utilization=$gpu_memory_utilization to serve model." else echo "Cannot find a proper gpu_memory_utilization over 0.9 to serve the model, please check logs in $LOG_FOLDER." exit 1 fi for num_seqs in "${num_seqs_list[@]}"; do for num_batched_tokens in "${num_batched_tokens_list[@]}"; do run_benchmark $num_seqs $num_batched_tokens $gpu_memory_utilization done done echo "finish permutations" echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput, profile saved in: $PROFILE_PATH" echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput, profile saved in: $PROFILE_PATH" >> "$RESULT" ``` 这个是啥代码,详细分析并解释代码
08-22
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值