ktransformer本地部署deepseek-R1笔记

最新推荐文章于 2025-06-17 13:51:51 发布

starvapour

最新推荐文章于 2025-06-17 13:51:51 发布

阅读量1.7k

点赞数 23

CC 4.0 BY-SA版权

分类专栏： python/AI学习笔记文章标签：笔记

本文链接：https://blog.youkuaiyun.com/starvapour/article/details/146034905

python/AI学习笔记专栏收录该内容

21 篇文章

订阅专栏

一种是基于nv的方案，另一种是基于conda的方案，任选一个即可，考虑到自动化脚本启动的便捷性，基于uv的方案可能更加方便编写sh脚本。

最后的测试结果是：ktransformer当前的版本对于多卡有还不知道怎么解决的报错，单卡的话能够运行，但是一次对话可能要几分钟，个人娱乐可以，但是要实际做什么功能，响应时间就有点太久了。

1. 基于venv实现的docker服务端全流程安装

参考资料：ktransformers 上的 DeepSeek-R1 671B open-webui

1.1 创建docker

docker run --name llm_server --runtime=nvidia --gpus all -p 3000:8080 -v /hdd/llm:/home -it nvidia/cuda:12.1.0-cudnn8-devel-ubuntu20.04 bash

1.2 基本环境搭建

apt-get update
apt-get install -y sudo curl screen systemctl vim git language-pack-zh-hans build-essential cmake
locale-gen zh_CN.UTF-8
echo "export LC_ALL=zh_CN.UTF-8">> ~/.bashrc
source ~/.bashrc

配置screen默认启动bash

vi ~/.screenrc

在配置文件中添加：

defshell -bash

保存退出即可。

1.3 安装uv工具链

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
which uv
which uvx

1.4 创建python虚拟环境

uv venv /usr/local/llm_server --python 3.11 --python-preference=only-managed

python版本: 3.11.11
虚拟环境路径: /usr/local/llm_server
激活命令: source /usr/local/llm_server/bin/activate

1.5 安装open-webui

source /usr/local/llm_server/bin/activate
uv pip install open-webui -i https://mirrors.aliyun.com/pypi/simple

1.6 安装ktransformers依赖的库

git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
git checkout 7a19f3b
git rev-parse --short HEAD
git submodule init
git submodule update

source /usr/local/llm_server/bin/activate
uv pip install -r requirements-local_chat.txt -i https://mirrors.aliyun.com/pypi/simple
uv pip install setuptools wheel packaging -i https://mirrors.aliyun.com/pypi/simple

1.7 检查硬件配置

检查插槽数量（如果为2插槽，在编译时使用export USE_NUMA=1）

lscpu | grep 'Socket(s):'

检查CPU核心数量（128）

lscpu | grep '^CPU(s):'

1.8 配置系统核心

系统的物理CPU核心数量减去一些

export MAX_JOBS=64
export CMAKE_BUILD_PARALLEL_LEVEL=64

1.9 安装flash_attn

uv pip install flash_attn --no-build-isolation -i https://mirrors.aliyun.com/pypi/simple
export UV_LINK_MODE=copy
uv pip install flash_attn --no-build-isolation -i https://mirrors.aliyun.com/pypi/simple

1.10 编译安装ktransformers

export USE_NUMA=1
USE_NUMA=1 KTRANSFORMERS_FORCE_BUILD=TRUE uv pip install . --no-build-isolation -i https://mirrors.aliyun.com/pypi/simple

uv pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 -i https://mirrors.aliyun.com/pypi/simple
uv pip install ktransformers-0.2.2rc2+cu121torch23avx2-cp311-cp311-linux_x86_64.whl -i https://mirrors.aliyun.com/pypi/simple

1.11 安装flashinfer

git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer
FLASHINFER_ENABLE_AOT=1 uv pip install -e . -v -i https://mirrors.aliyun.com/pypi/simple

1.12 启动ktransformers

source /usr/local/llm_server/bin/activate
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python3 /home/projects/ktransformers/ktransformers/server/main.py     \
--gguf_path /home/models/DeepSeek-R1-UD-Q2_K_XL     \
--model_path /home/models/DeepSeek-R1     \
--model_name unsloth/DeepSeek-R1-UD-Q2_K_XL     \
--cpu_infer 96     \
--max_new_tokens 1024    \
--cache_q4 true     \
--temperature 0.6     \
--top_p 0.95     \
--optimize_config_path /home/projects/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml    \
--force_think     \
--use_cuda_graph     \
--host 0.0.0.0     \
--port 10002

1.13 测试api服务是否运行正常：

curl http://0.0.0.0:10002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1",
    "messages": [{"role": "user", "content": "你是谁!"}]
  }'

1.14 安装ollama

如果用于安装的设备拥有上网的妙妙工具的话，可以使用官方提供的一键安装脚本：

curl -fsSL https://ollama.com/install.sh | sh

如果没有的话，就得按照手动处理一步步来了，因为一键脚本会因为网络超时报错。

首先，手动下载ollama的压缩包，然后放到要安装的docker内磁盘上：
https://ollama.com/download/ollama-linux-amd64.tgz
将压缩包解压到系统安装路径：

sudo tar -C /usr -xzf ollama-linux-amd64.tgz

添加为自动启动服务：

sudo useradd -r -s /bin/false -U -m -d /usr/share/ollama ollama
sudo usermod -a -G ollama $(whoami)

创建配置文件：

vi /etc/systemd/system/ollama.service

在配置文件中添加以下内容并保存（按i进入输入模式后右键粘贴，然后按esc，输入:wq最后回车退出）：

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=$PATH"
Environment="OLLAMA_MODELS=/home/models/ollama_model_cache" # 确保该文件夹有足够权限操作！

[Install]
WantedBy=default.target

如果服务无法启动，检查保存模型的文件夹是否有足够操作权限！
启动服务

sudo systemctl daemon-reload
sudo systemctl enable ollama

1.15 启动ollama

启动ollama（如果以后运行时，提示ollama没有启动，就用这个命令启动）

sudo systemctl start ollama

查询ollama运行状态

sudo systemctl status ollama

监控服务是否dead并自动重启的sh脚本

#!/bin/bash

SERVICE="ollama"

while true; do
    if systemctl is-active --quiet $SERVICE; then
        echo "$(date): $SERVICE is running."
    else
        echo "$(date): $SERVICE is not running. Starting $SERVICE..."
        systemctl start $SERVICE
    fi
    sleep 60
done

1.16 配置open-webui启动脚本

export ENABLE_OLLAMA_API=True
export ENABLE_OPENAI_API=True
export OPENAI_API_KEY="dont_change_this_cuz_openai_is_the_mcdonalds_of_ai"
export OPENAI_API_BASE_URL="http://0.0.0.0:10002/v1" # <--- 需与ktransformers/llama.cpp的API配置匹配
#export DEFAULT_MODELS="openai/foo/bar" # <--- 保留注释，此参数用于`litellm`接入
export WEBUI_AUTH=False
export DEFAULT_USER_ROLE="admin"
export HOST=0.0.0.0
export PORT=8080 # <--- open-webui网页服务端口

open-webui serve \
  --host $HOST \
  --port $PORT

1.17 open-webui界面设置

使用高性能消耗的模型时，进入设置页面，关闭自动生成标题与对话标签，否则会产生额外的对话影响性能。
设置-通用-高级参数-以流式返回对话响应设置为开启。

2. 基于conda的ktransformer与webui的docker部署

2.1 创建docker

11434端口是之前为了兼容ollama，不加也无所谓

docker run --name ktransform_server --runtime=nvidia --gpus all -p 11434:11434 -p 3000:8080 -v /data_hdd1/shared_data/ollama:/home -it nvidia/cuda:12.0.0-devel-ubuntu22.04 bash

2.2 ktransformer安装

安装Anaconda

export SHELL=/bin/bash
./Anaconda3-2024.10-1-Linux-x86_64.sh
source ~/.bashrc

安装基本依赖库

apt-get update
apt-get install -y sudo curl screen systemctl vim git language-pack-zh-hans
locale-gen zh_CN.UTF-8
echo "export LC_ALL=zh_CN.UTF-8">> ~/.bashrc
source ~/.bashrc

配置screen启动bash

vi ~/.screenrc

在配置文件中添加：

defshell -bash

创建Anaconda虚拟环境

conda create -n ktransformer python==3.11
conda activate ktransformer

拉取ktransformer工程

git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
git submodule init
git submodule update
git checkout 7a19f3b
git rev-parse --short HEAD # 7a19f3b

安装依赖库，whl文件需要从对应github下载

pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121 -i https://mirrors.aliyun.com/pypi/simple
# 下载地址：https://github.com/ubergarm/ktransformers/releases
pip install ktransformers-0.2.2rc2+cu121torch23avx2-cp311-cp311-linux_x86_64.whl -i https://mirrors.aliyun.com/pypi/simple
# 下载地址：https://github.com/mjun0812/flash-attention-prebuild-wheels/releases
pip install flash_attn-2.7.4.post1+cu12torch2.3cxx11abiFALSE-cp311-cp311-linux_x86_64.whl -i https://mirrors.aliyun.com/pypi/simple

调整依赖库版本

sudo apt update
sudo apt install software-properties-common
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install --only-upgrade libstdc++6
cp /usr/lib/x86_64-linux-gnu/libstdc++.so.6 /root/anaconda3/envs/ktransformer/lib/libstdc++.so.6

安装flashinfer（这一步似乎是可选的，不确定对性能的影响）

git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer
FLASHINFER_ENABLE_AOT=1 pip install -e . -v -i https://mirrors.aliyun.com/pypi/simple

构造一个/home/DeepSeek-R1文件夹用于存放模型配置文件，从hugging face上的DeepSeek-R1中下载所有的小文件，例如config.json等放入文件夹内，再从hugging face上的DeepSeek-V3中下载tokenizer_config.json和tokenizer.json到这个文件夹内。
在启动ktransformer的api服务时，将model_path的路径指向这个文件夹。

下载量化版本的gguf模型，将运行时的gguf_path指向对应文件夹。
下载地址1：huggingface
下载地址2：modelscope

启动api服务端

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 \
/home/ktransformers/ktransformers/server/main.py     \
--gguf_path /home/DeepSeek-R1-UD-Q2_K_XL    \
 --model_path /home/DeepSeek-R1    \
  --model_name unsloth/DeepSeek-R1-UD-Q2_K_XL     \
  --cpu_infer 16     \
  --max_new_tokens 8192    \
   --cache_lens 32768     \
   --total_context 32768    \
    --cache_q4 true     \
    --temperature 0.6     \
    --top_p 0.95     \
    --optimize_config_path /home/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml     \
    --force_think     \
    --use_cuda_graph     \
    --host 0.0.0.0     \
    --port 10002

通过curl指令快速测试ktransformer的api服务是否运行正常：

curl http://0.0.0.0:10002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1",
    "messages": [{"role": "user", "content": "介绍你自己"}]
  }'

2.3 安装open-webui

创建python版本>=3.11的虚拟环境

conda create -n open-webui python==3.12

激活虚拟环境

conda activate open-webui

安装open-webui（需要很久）

pip install open-webui -i https://mirrors.aliyun.com/pypi/simple/

在~/.bashrc中添加环境变量

export WEBUI_AUTH=False # 关闭用户注册
export ENABLE_OPENAI_API=0 # 关闭openai的连接（否则可能长时间白屏）

配置open-webui的.sh启动脚本

conda activate open-webui

export ENABLE_OLLAMA_API=False
export ENABLE_OPENAI_API=True
export OPENAI_API_KEY="dont_change_this_cuz_openai_is_the_mcdonalds_of_ai"
export OPENAI_API_BASE_URL="http://0.0.0.0:10002/v1" # <--- 需与ktransformers/llama.cpp的API配置匹配
#export DEFAULT_MODELS="openai/foo/bar" # <--- 保留注释，此参数用于`litellm`接入
export WEBUI_AUTH=False
export DEFAULT_USER_ROLE="admin"
export HOST=0.0.0.0
export PORT=8080 # <--- open-webui网页服务端口

open-webui serve \
  --host $HOST \
  --port $PORT