Xinference推理架构shell启动方式_xinference 启动命令-优快云博客

LLM

首先vim编辑一个shell脚本，比如vim start_model.shell，然后把如下命令修改为自己的模型然后粘贴，在chmod 777 start_model.shell赋予权限，然后启动，具体命令如下：

vim start_model.shell #编辑shell脚本

chmod 777 start_model.shell #赋予权限

./start_model.shell #启动shell脚本

curl 'http://127.0.0.1:9997/v1/models' \
-H 'Accept: */*' \
-H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' \
-H 'Connection: keep-alive' \
-H 'Content-Type: application/json' \
-H 'Cookie: token=no_auth' \
-H 'Origin: http://127.0.0.1:9997' \
-H 'Referer: http://127.0.0.1:9997/ui/' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36' \
-H 'sec-ch-ua: "Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "Linux"' \
--data-raw '{"model_uid":null,"model_name":"qwen2-instruct","model_type":"LLM","model_engine":"Vllm","model_format":"pytorch","model_size_in_billions":7,"quantization":"none","n_gpu":"auto","replica":1,"request_limits":null,"worker_ip":null,"gpu_idx":null}'

各部分解释

URL:

curl 'http://127.0.0.1:9997/v1/models'

请求头:

-H 'Accept: */*'
-H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8'
-H 'Connection: keep-alive'
-H 'Content-Type: application/json'
-H 'Cookie: token=no_auth'
-H 'Origin: http://127.0.0.1:9997'
-H 'Referer: http://127.0.0.1:9997/ui/'
-H 'Sec-Fetch-Dest: empty'
-H 'Sec-Fetch-Mode: cors'
-H 'Sec-Fetch-Site: same-origin'
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'
-H 'sec-ch-ua: "Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"'
-H 'sec-ch-ua-mobile: ?0'
-H 'sec-ch-ua-platform: "Linux"'

Accept: 指定客户端可以接受的内容类型。
Accept-Language: 指定客户端可以接受的语言。
Connection: 控制连接的管理方式。
Content-Type: 指定请求体的内容类型，这里是 application/json。
Cookie: 发送请求时附带的 Cookie。
Origin: 指定请求的来源。
Referer: 指定请求的来源页面。
Sec-Fetch-Dest, Sec-Fetch-Mode, Sec-Fetch-Site: 这些是与安全相关的请求头，用于描述请求的上下文。
User-Agent: 指定客户端的用户代理信息。
sec-ch-ua, sec-ch-ua-mobile, sec-ch-ua-platform: 这些是与客户端硬件和软件环境相关的请求头。

请求体:

--data-raw '{"model_uid":null,"model_name":"qwen2-instruct","model_type":"LLM","model_engine":"Vllm,"model_format":"pytorch","model_size_in_billions":7,"quantization":"none","n_gpu":"auto","replica":1,"request_limits":null,"worker_ip":null,"gpu_idx":null}'

- model_uid: 模型的唯一标识符
- model_name: 模型的名称。
- model_type: 模型的类型（例如 LLM 表示大语言模型）。
- model_engine: 模型使用的引擎（例如 Transformers,Vllm,llama.cpp，具体根据自己模型选择）。
- model_format: 模型的格式（例如 pytorch，ggufv2）。
- model_size_in_billions: 模型的大小，以十亿为单位。(模型参数是多少就选多少，比例如我选的是qwen2-7b-instruct，那么我选的就是7)
- quantization: 模型的量化方式（例如 none ，8-bit，4-bit）。
- n_gpu: 使用的 GPU 数量（目前我的测试是只能选择auto）。
- replica: 模型的副本数量。
- request_limits: 请求限制（这里是 null）。
- worker_ip: 工作节点的 IP 地址（这里是 null）。
- gpu_idx: GPU 的索引（这里是 null）。

model_name选择，进入xinference目录下的cache，可以看到模型的名称

emb_model

curl 'http://127.0.0.1:9997/v1/models' \
-H 'Accept: */*' \
-H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' \
-H 'Connection: keep-alive' \
-H 'Content-Type: application/json' \
-H 'Cookie: token=no_auth' \
-H 'Origin: http://127.0.0.1:9997' \
-H 'Referer: http://127.0.0.1:9997/ui/' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36' \
-H 'sec-ch-ua: "Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "Linux"' \
--data-raw '{"model_uid":"bge-large-zh-v1.5","model_name":"bge-large-zh-v1.5","model_type":"embedding","replica":1,"n_gpu":"auto","worker_ip":null,"gpu_idx":null}'