ktransformer本地部署deepseek-R1笔记

该文章已生成可运行项目,

一种是基于nv的方案,另一种是基于conda的方案,任选一个即可,考虑到自动化脚本启动的便捷性,基于uv的方案可能更加方便编写sh脚本。

最后的测试结果是:ktransformer当前的版本对于多卡有还不知道怎么解决的报错,单卡的话能够运行,但是一次对话可能要几分钟,个人娱乐可以,但是要实际做什么功能,响应时间就有点太久了。

1. 基于venv实现的docker服务端全流程安装

参考资料:ktransformers 上的 DeepSeek-R1 671B open-webui

1.1 创建docker

docker run --name llm_server --runtime=nvidia --gpus all -p 3000:8080 -v /hdd/llm:/home -it nvidia/cuda:12.1.0-cudnn8-devel-ubuntu20.04 bash

1.2 基本环境搭建

apt-get update
apt-get install -y sudo curl screen systemctl vim git language-pack-zh-hans build-essential cmake
locale-gen zh_CN.UTF-8
echo "export LC_ALL=zh_CN.UTF-8">> ~/.bashrc
source ~/.bashrc

配置screen默认启动bash

vi ~/.screenrc

在配置文件中添加:

defshell -bash

保存退出即可。

1.3 安装uv工具链

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
which uv
which uvx

1.4 创建python虚拟环境

uv venv /usr/local/llm_server --python 3.11 --python-preference=only-managed

python版本: 3.11.11
虚拟环境路径: /usr/local/llm_server
激活命令: source /usr/local/llm_server/bin/activate

1.5 安装open-webui

source /usr/local/llm_server/bin/activate
uv pip install open-webui -i https://mirrors.aliyun.com/pypi/simple

1.6 安装ktransformers依赖的库

git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
git checkout 7a19f3b
git rev-parse --short HEAD
git submodule init
git submodule update

source /usr/local/llm_server/bin/activate
uv pip install -r requirements-local_chat.txt -i https://mirrors.aliyun.com/pypi/simple
uv pip install setuptools wheel packaging -i https://mirrors.aliyun.com/pypi/simple

1.7 检查硬件配置

检查插槽数量(如果为2插槽,在编译时使用export USE_NUMA=1)

lscpu | grep 'Socket(s):'

检查CPU核心数量(128)

lscpu | grep '^CPU(s):'

1.8 配置系统核心

系统的物理CPU核心数量减去一些

export MAX_JOBS=64
export CMAKE_BUILD_PARALLEL_LEVEL=64

1.9 安装flash_attn

uv pip install flash_attn --no-build-isolation -i https://mirrors.aliyun.com/pypi/simple
export UV_LINK_MODE=copy
uv pip install flash_attn --no-build-isolation -i https://mirrors.aliyun.com/pypi/simple

1.10 编译安装ktransformers

export USE_NUMA=1
USE_NUMA=1 KTRANSFORMERS_FORCE_BUILD=TRUE uv pip install . --no-build-isolation -i https://mirrors.aliyun.com/pypi/simple

uv pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 -i https://mirrors.aliyun.com/pypi/simple
uv pip install ktransformers-0.2.2rc2+cu121torch23avx2-cp311-cp311-linux_x86_64.whl -i https://mirrors.aliyun.com/pypi/simple

1.11 安装flashinfer

git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer
FLASHINFER_ENABLE_AOT=1 uv pip install -e . -v -i https://mirrors.aliyun.com/pypi/simple

1.12 启动ktransformers

source /usr/local/llm_server/bin/activate
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python3 /home/projects/ktransformers/ktransformers/server/main.py     \
--gguf_path /home/models/DeepSeek-R1-UD-Q2_K_XL     \
--model_path /home/models/DeepSeek-R1     \
--model_name unsloth/DeepSeek-R1-UD-Q2_K_XL     \
--cpu_infer 96     \
--max_new_tokens 1024    \
--cache_q4 true     \
--temperature 0.6     \
--top_p 0.95     \
--optimize_config_path /home/projects/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml    \
--force_think     \
--use_cuda_graph     \
--host 0.0.0.0     \
--port 10002

1.13 测试api服务是否运行正常:

curl http://0.0.0.0:10002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1",
    "messages": [{"role": "user", "content": "你是谁!"}]
  }'

1.14 安装ollama

如果用于安装的设备拥有上网的妙妙工具的话,可以使用官方提供的一键安装脚本:

curl -fsSL https://ollama.com/install.sh | sh

如果没有的话,就得按照手动处理一步步来了,因为一键脚本会因为网络超时报错。

首先,手动下载ollama的压缩包,然后放到要安装的docker内磁盘上:
https://ollama.com/download/ollama-linux-amd64.tgz
将压缩包解压到系统安装路径:

sudo tar -C /usr -xzf ollama-linux-amd64.tgz

添加为自动启动服务:

sudo useradd -r -s /bin/false -U -m -d /usr/share/ollama ollama
sudo usermod -a -G ollama $(whoami)

创建配置文件:

vi /etc/systemd/system/ollama.service

在配置文件中添加以下内容并保存(按i进入输入模式后右键粘贴,然后按esc,输入:wq最后回车退出):

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=$PATH"
Environment="OLLAMA_MODELS=/home/models/ollama_model_cache" # 确保该文件夹有足够权限操作!

[Install]
WantedBy=default.target

如果服务无法启动,检查保存模型的文件夹是否有足够操作权限!
启动服务

sudo systemctl daemon-reload
sudo systemctl enable ollama

1.15 启动ollama

启动ollama(如果以后运行时,提示ollama没有启动,就用这个命令启动)

sudo systemctl start ollama

查询ollama运行状态

sudo systemctl status ollama

监控服务是否dead并自动重启的sh脚本

#!/bin/bash

SERVICE="ollama"

while true; do
    if systemctl is-active --quiet $SERVICE; then
        echo "$(date): $SERVICE is running."
    else
        echo "$(date): $SERVICE is not running. Starting $SERVICE..."
        systemctl start $SERVICE
    fi
    sleep 60
done

1.16 配置open-webui启动脚本

export ENABLE_OLLAMA_API=True
export ENABLE_OPENAI_API=True
export OPENAI_API_KEY="dont_change_this_cuz_openai_is_the_mcdonalds_of_ai"
export OPENAI_API_BASE_URL="http://0.0.0.0:10002/v1" # <--- 需与ktransformers/llama.cpp的API配置匹配
#export DEFAULT_MODELS="openai/foo/bar" # <--- 保留注释,此参数用于`litellm`接入
export WEBUI_AUTH=False
export DEFAULT_USER_ROLE="admin"
export HOST=0.0.0.0
export PORT=8080 # <--- open-webui网页服务端口

open-webui serve \
  --host $HOST \
  --port $PORT

1.17 open-webui界面设置

使用高性能消耗的模型时,进入设置页面,关闭自动生成标题与对话标签,否则会产生额外的对话影响性能。
设置-通用-高级参数-以流式返回对话响应设置为开启。

2. 基于conda的ktransformer与webui的docker部署

2.1 创建docker

11434端口是之前为了兼容ollama,不加也无所谓

docker run --name ktransform_server --runtime=nvidia --gpus all -p 11434:11434 -p 3000:8080 -v /data_hdd1/shared_data/ollama:/home -it nvidia/cuda:12.0.0-devel-ubuntu22.04 bash

2.2 ktransformer安装

安装Anaconda

export SHELL=/bin/bash
./Anaconda3-2024.10-1-Linux-x86_64.sh
source ~/.bashrc

安装基本依赖库

apt-get update
apt-get install -y sudo curl screen systemctl vim git language-pack-zh-hans
locale-gen zh_CN.UTF-8
echo "export LC_ALL=zh_CN.UTF-8">> ~/.bashrc
source ~/.bashrc

配置screen启动bash

vi ~/.screenrc

在配置文件中添加:

defshell -bash

创建Anaconda虚拟环境

conda create -n ktransformer python==3.11
conda activate ktransformer

拉取ktransformer工程

git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
git submodule init
git submodule update
git checkout 7a19f3b
git rev-parse --short HEAD # 7a19f3b

安装依赖库,whl文件需要从对应github下载

pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121 -i https://mirrors.aliyun.com/pypi/simple
# 下载地址:https://github.com/ubergarm/ktransformers/releases
pip install ktransformers-0.2.2rc2+cu121torch23avx2-cp311-cp311-linux_x86_64.whl -i https://mirrors.aliyun.com/pypi/simple
# 下载地址:https://github.com/mjun0812/flash-attention-prebuild-wheels/releases
pip install flash_attn-2.7.4.post1+cu12torch2.3cxx11abiFALSE-cp311-cp311-linux_x86_64.whl -i https://mirrors.aliyun.com/pypi/simple

调整依赖库版本

sudo apt update
sudo apt install software-properties-common
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install --only-upgrade libstdc++6
cp /usr/lib/x86_64-linux-gnu/libstdc++.so.6 /root/anaconda3/envs/ktransformer/lib/libstdc++.so.6

安装flashinfer(这一步似乎是可选的,不确定对性能的影响)

git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer
FLASHINFER_ENABLE_AOT=1 pip install -e . -v -i https://mirrors.aliyun.com/pypi/simple

构造一个/home/DeepSeek-R1文件夹用于存放模型配置文件,从hugging face上的DeepSeek-R1中下载所有的小文件,例如config.json等放入文件夹内,再从hugging face上的DeepSeek-V3中下载tokenizer_config.jsontokenizer.json到这个文件夹内。
在启动ktransformer的api服务时,将model_path的路径指向这个文件夹。

下载量化版本的gguf模型,将运行时的gguf_path指向对应文件夹。
下载地址1:huggingface
下载地址2:modelscope

启动api服务端

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 \
/home/ktransformers/ktransformers/server/main.py     \
--gguf_path /home/DeepSeek-R1-UD-Q2_K_XL    \
 --model_path /home/DeepSeek-R1    \
  --model_name unsloth/DeepSeek-R1-UD-Q2_K_XL     \
  --cpu_infer 16     \
  --max_new_tokens 8192    \
   --cache_lens 32768     \
   --total_context 32768    \
    --cache_q4 true     \
    --temperature 0.6     \
    --top_p 0.95     \
    --optimize_config_path /home/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml     \
    --force_think     \
    --use_cuda_graph     \
    --host 0.0.0.0     \
    --port 10002

通过curl指令快速测试ktransformer的api服务是否运行正常:

curl http://0.0.0.0:10002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1",
    "messages": [{"role": "user", "content": "介绍你自己"}]
  }'

2.3 安装open-webui

创建python版本>=3.11的虚拟环境

conda create -n open-webui python==3.12

激活虚拟环境

conda activate open-webui

安装open-webui(需要很久)

pip install open-webui -i https://mirrors.aliyun.com/pypi/simple/ 

~/.bashrc中添加环境变量

export WEBUI_AUTH=False # 关闭用户注册
export ENABLE_OPENAI_API=0 # 关闭openai的连接(否则可能长时间白屏)

配置open-webui的.sh启动脚本

conda activate open-webui

export ENABLE_OLLAMA_API=False
export ENABLE_OPENAI_API=True
export OPENAI_API_KEY="dont_change_this_cuz_openai_is_the_mcdonalds_of_ai"
export OPENAI_API_BASE_URL="http://0.0.0.0:10002/v1" # <--- 需与ktransformers/llama.cpp的API配置匹配
#export DEFAULT_MODELS="openai/foo/bar" # <--- 保留注释,此参数用于`litellm`接入
export WEBUI_AUTH=False
export DEFAULT_USER_ROLE="admin"
export HOST=0.0.0.0
export PORT=8080 # <--- open-webui网页服务端口

open-webui serve \
  --host $HOST \
  --port $PORT

运行启动脚本,即可接入ktransformer的api服务进行对话。

本文章已经生成可运行项目
### KTransformer in Machine Learning Libraries and Frameworks KTransformer does not appear directly within standard references provided regarding popular machine learning libraries and frameworks [^1]. However, Transformer architectures have become fundamental components across numerous advanced natural language processing (NLP) tasks and beyond. The term "KTransformer" might refer to specific implementations or variations of the Transformer model tailored by certain communities or organizations. For instance, while exploring deep learning libraries, one may encounter specialized versions like BERT, GPT, T5, etc., which are all built upon the core principles introduced by the original Transformer paper but include unique features optimized for particular applications [^3]. If considering integration into existing workflows involving big data platforms such as Hadoop, it would be beneficial to look at how these systems support modern neural networks including those derived from Transformers [^4]. Additionally, when working with web-based environments utilizing JavaScript, there exist several libraries capable of supporting complex models similar to what could potentially fall under a 'KTransformer' category [^2]. To effectively implement or experiment with something akin to a KTransformer: - Investigate established NLP-focused projects incorporating Transformer technology. - Examine documentation related to custom layer development within preferred ML/DL frameworks. - Explore community-driven repositories where novel adaptations often emerge first.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值