vllm 起 DeepSeek-R1-Distill-Qwen-32B 推理服务并调用

使用vllm server 起DeepSeek-R1-Distill-Qwen-32B推理服务命令:
配置双卡 指定port 9902

CUDA_VISIBLE_DEVICES=8,9 VLLM_WORKER_MULTIPROC_METHOD=spawn vllm serve /model/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768  --port 9902 --host 127.0.0.1

curl命令调用DeepSeek-R1-Distill-Qwen-32B推理服务

curl http://127.0.0.1:9902/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "/model/DeepSeek-R1-Distill-Qwen-32B",
  "messages": [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": "Tell me something about large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "repetition_penalty": 1.05,
  "max_tokens": 512
}'

request 调用DeepSeek-R1-Distill-Qwen-32B推理服务

import requests
host = "localhost"
port = 9902

payload={ 
         "model":"/model/DeepSeek-R1-Distill-Qwen-32B",
         "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": prompt}],
         "temperature": 0.7,
         "top_p": 0.8,
         "top_k": 50,
         "max_tokens": 4096,
         "presence_penalty": 0.0, "frequency_penalty": 0.0,
         "stop": ["<|im_end|>", "<|endoftext|>"]
        }

print("############ payload:\n",payload, "\n\n")
response = requests.post(api_url, json=payload)

if response.status_code == 200:
    result = response.json()
    print("##################### Result: ", result)
    print("##################### Generated text:", result["choices"][0]["message"]["content"])
else:
    print("Error:", response.status_code, response.text)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值