LLM
首先vim编辑一个shell脚本,比如vim start_model.shell,然后把如下命令修改为自己的模型然后粘贴,在chmod 777 start_model.shell赋予权限,然后启动,具体命令如下:
vim start_model.shell #编辑shell脚本
chmod 777 start_model.shell #赋予权限
./start_model.shell #启动shell脚本
curl 'http://127.0.0.1:9997/v1/models' \
-H 'Accept: */*' \
-H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' \
-H 'Connection: keep-alive' \
-H 'Content-Type: application/json' \
-H 'Cookie: token=no_auth' \
-H 'Origin: http://127.0.0.1:9997' \
-H 'Referer: http://127.0.0.1:9997/ui/' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36' \
-H 'sec-ch-ua: "Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "Linux"' \
--data-raw '{"model_uid":null,"model_name":"qwen2-instruct","model_type":"LLM","model_engine":"Vllm","model_format":"pytorch","model_size_in_billions":7,"quantization":"none","n_gpu":"auto","replica":1,"request_limits":null,"worker_ip":null,"gpu_idx":null}'
各部分解释
URL:
curl 'http://127.0.0.1:9997/v1/models'
请求头:
-H 'Accept: */*'
-H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8'
-H 'Connection: keep-alive'
-H 'Content-Type: application/json'
-H 'Cookie: token=no_auth'
-H 'Origin: http://127.0.0.1:9997'
-H 'Referer: http://127.0.0.1:9997/ui/'
-H 'Sec-Fetch-Dest: empty'
-H 'Sec-Fetch-Mode: cors'
-H 'Sec-Fetch-Site: same-origin'
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'
-H 'sec-ch-ua: "Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"'
-H 'sec-ch-ua-mobile: ?0'
-H 'sec-ch-ua-platform: "Linux"'
Accept: 指定客户端可以接受的内容类型。Accept-Language: 指定客户端可以接受的语言。Connection: 控制连接的管理方式。Content-Type: 指定请求体的内容类型,这里是application/json。Cookie: 发送请求时附带的 Cookie。Origin: 指定请求的来源。Referer: 指定请求的来源页面。Sec-Fetch-Dest,Sec-Fetch-Mode,Sec-Fetch-Site: 这些是与安全相关的请求头,用于描述请求的上下文。User-Agent: 指定客户端的用户代理信息。sec-ch-ua,sec-ch-ua-mobile,sec-ch-ua-platform: 这些是与客户端硬件和软件环境相关的请求头。
请求体:
--data-raw '{"model_uid":null,"model_name":"qwen2-instruct","model_type":"LLM","model_engine":"Vllm,"model_format":"pytorch","model_size_in_billions":7,"quantization":"none","n_gpu":"auto","replica":1,"request_limits":null,"worker_ip":null,"gpu_idx":null}'
-
model_uid: 模型的唯一标识符model_name: 模型的名称。model_type: 模型的类型(例如LLM表示大语言模型)。model_engine: 模型使用的引擎(例如Transformers,Vllm,llama.cpp,具体根据自己模型选择)。model_format: 模型的格式(例如pytorch,ggufv2)。model_size_in_billions: 模型的大小,以十亿为单位。(模型参数是多少就选多少,比例如我选的是qwen2-7b-instruct,那么我选的就是7)quantization: 模型的量化方式(例如none,8-bit,4-bit)。n_gpu: 使用的 GPU 数量(目前我的测试是只能选择auto)。replica: 模型的副本数量。request_limits: 请求限制(这里是null)。worker_ip: 工作节点的 IP 地址(这里是null)。gpu_idx: GPU 的索引(这里是null)。
model_name选择,进入xinference目录下的cache,可以看到模型的名称![]()

emb_model
curl 'http://127.0.0.1:9997/v1/models' \
-H 'Accept: */*' \
-H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' \
-H 'Connection: keep-alive' \
-H 'Content-Type: application/json' \
-H 'Cookie: token=no_auth' \
-H 'Origin: http://127.0.0.1:9997' \
-H 'Referer: http://127.0.0.1:9997/ui/' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36' \
-H 'sec-ch-ua: "Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "Linux"' \
--data-raw '{"model_uid":"bge-large-zh-v1.5","model_name":"bge-large-zh-v1.5","model_type":"embedding","replica":1,"n_gpu":"auto","worker_ip":null,"gpu_idx":null}'
启动脚本命令
vim start_model_emb.shell #编辑shell脚本
chmod 777 start_model_emb.shell #赋予脚本权限
./start_model_emb.shell #启动脚本
注:如果你只有一张显卡,建议先启动embedding_model,因为先启动LLM,可能就会提示如下报错,他建议你先启动embedding_model,小编是4090,24G显存,亲测有效,具体报错如下:
![]()

被折叠的 条评论
为什么被折叠?



