使用vllm server 起DeepSeek-R1-Distill-Qwen-32B推理服务命令:
配置双卡 指定port 9902
CUDA_VISIBLE_DEVICES=8,9 VLLM_WORKER_MULTIPROC_METHOD=spawn vllm serve /model/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768 --port 9902 --host 127.0.0.1
curl命令调用DeepSeek-R1-Distill-Qwen-32B推理服务
curl http://127.0.0.1:9902/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "/model/DeepSeek-R1-Distill-Qwen-32B",
"messages": [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"repetition_penalty": 1.05,
"max_tokens": 512
}'
request 调用DeepSeek-R1-Distill-Qwen-32B推理服务
import requests
host = "localhost"
port = 9902
payload={
"model":"/model/DeepSeek-R1-Distill-Qwen-32B",
"messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": prompt}],
"temperature": 0.7,
"top_p": 0.8,
"top_k": 50,
"max_tokens": 4096,
"presence_penalty": 0.0, "frequency_penalty": 0.0,
"stop": ["<|im_end|>", "<|endoftext|>"]
}
print("############ payload:\n",payload, "\n\n")
response = requests.post(api_url, json=payload)
if response.status_code == 200:
result = response.json()
print("##################### Result: ", result)
print("##################### Generated text:", result["choices"][0]["message"]["content"])
else:
print("Error:", response.status_code, response.text)