一、购买阿里云服务器
购买上面规格实例即可。
二、安装VLLM
sudo apt-get update
sudo apt-get install python3-venv
创建虚拟环境
python3 -m venv vllm # vllm 是虚拟环境名称,可自定义
激活虚拟环境
source vllm/bin/activate
在虚拟环境中安装 vllm
pip install --upgrade pip # 更新 pip(可选)
pip install vllm
三、运行DeepSeek-R1-Distill-Qwen-1.5B
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --tensor-parallel-size 1 --max-model-len 4096 --enforce-eager
这里直接运行会报错,报找不到显卡的错误。按照下面的方法来进行解决
# 卸载原来的显卡驱动
sudo apt-get purge nvidia*
# 添加源
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
# 查看可以安装的驱动版本
sudo ubuntu-drivers devices
driver : nvidia-driver-560 - third-party non-free recommended
# 安装对应的版本
apt-get install nvidia-driver-560 -y
#验证安装是否Ok
(vllm) root@iZbp13dby3bc92091yih5zZ:~/vllm# nvidia-smi
Fri Mar 7 16:18:26 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A10 Off | 00000000:00:07.0 Off | 0 |
| 0% 43C P0 56W / 150W | 1MiB / 23028MiB | 12% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
(vllm) root@iZbp13dby3bc92091yih5zZ:~/vllm# python3 -c "import torch; print(torch.cuda.is_available())"
True
再次安装:
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --tensor-parallel-size 1 --max-model-len 32768 --enforce-eager
此次还是不能启动成功,由于无法下载到大模型文件导致。
# 设置镜像
export HF_ENDPOINT=https://hf-mirror.com
# 下载模型
mkdir -p /root/vllm/DeepSeek-R1-Distill-Qwen-1.5B
#启动模型下载
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --local-dir /root/vllm/DeepSeek-R1-Distill-Qwen-1.5B/
当下载完成,指定运行
vllm serve "/root/vllm/DeepSeek-R1-Distill-Qwen-1.5B" --tensor-parallel-size 1 --max-model-len 4096 --enforce-eager --dtype bfloat16 --gpu-memory-utilization 0.9
“/root/vllm/DeepSeek-R1-Distill-Qwen-1.5B” 本地的模型文件及运行后的模型名称
–tensor-parallel-size 1 \ # 张量并行GPU数量(如果指定为2,那么就需要两张GPU)
–max-model-len 4096 \ # 支持最大上下文长度(需高显存)
–enforce-eager \ # 禁用CUDA Graph(兼容性模式)
–dtype bfloat16 \ # 显存不足时改用auto或float16
–gpu-memory-utilization 0.9 # 显存利用率上限(默认0.9)
tmux new -s vllm_session # 创建会话
vllm serve "/root/vllm/DeepSeek-R1-Distill-Qwen-1.5B" \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--dtype auto
--gpu-memory-utilization 0.9
分离会话:
- 按
Ctrl + B
然后按D
键。
重新连接:
tmux attach -t vllm_session
四、测试
# Call the vllm server using curl:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
--data '{
"model": "/root/vllm/DeepSeek-R1-Distill-Qwen-1.5B",
"messages": [
{
"role": "user",
"content": "What is the capital of China?"
}
]
}'
五、查看vllm提供的接口信息
https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#chat-template
也可以从浏览器访问http://localhost:8000/docs接口来获取