vllm 起 DeepSeek-R1-Distill-Qwen-32B 推理服务并调用

白熊188

已于 2025-02-12 17:45:42 修改

阅读量664

点赞数 2

分类专栏：文本大模型文章标签：语言模型 nlp 人工智能 chatgpt

于 2025-02-12 17:27:44 首次发布

本文链接：https://blog.youkuaiyun.com/weixin_43988131/article/details/145596772

版权

文本大模型专栏收录该内容

23 篇文章

订阅专栏

使用vllm server 起DeepSeek-R1-Distill-Qwen-32B推理服务命令：
配置双卡指定port 9902

CUDA_VISIBLE_DEVICES=8,9 VLLM_WORKER_MULTIPROC_METHOD=spawn vllm serve /model/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768  --port 9902 --host 127.0.0.1

curl命令调用DeepSeek-R1-Distill-Qwen-32B推理服务

curl http://127.0.0.1:9902/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "/model/DeepSeek-R1-Distill-Qwen-32B",
  "messages": [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": "Tell me something about large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "repetition_penalty": 1.05,
  "max_tokens": 512
}'

request 调用DeepSeek-R1-Distill-Qwen-32B推理服务

import requests
host = "localhost"
port = 9902

payload={ 
         "model":"/model/DeepSeek-R1-Distill-Qwen-32B",
         "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": prompt}],
         "temperature": 0.7,
         "top_p": 0.8,
         "top_k": 50,
         "max_tokens": 4096,
         "presence_penalty": 0.0, "frequency_penalty": 0.0,
         "stop": ["<|im_end|>", "<|endoftext|>"]
        }

print("############ payload:\n",payload, "\n\n")
response = requests.post(api_url, json=payload)

if response.status_code == 200:
    result = response.json()
    print("##################### Result: ", result)
    print("##################### Generated text:", result["choices"][0]["message"]["content"])
else:
    print("Error:", response.status_code, response.text)