3. 进阶关卡-LMDeploy 量化部署进阶实践

本文链接：https://blog.youkuaiyun.com/codelearning/article/details/145207924

3. LMDeploy 量化部署进阶实践

1. 配置LMDeploy环境

1.1 InternStudio开发机创建与环境搭建

模型权重大小计算方式

我们要运行参数量为7B的InternLM2.5，由InternLM2.5的码仓查询InternLM2.5-7b-chat的config.json文件可知，该模型的权重被存储为bfloat16格式。

...
  "rope_theta": 1000000,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.41.0",
  "use_cache": true,
  "vocab_size": 92544,
  "pretraining_tp": 1
}

对于一个7B（70亿）参数的模型，每个参数使用16位浮点数（等于 2个 Byte）表示，则模型的权重大小约为：

7×10^9 parameters×2 Bytes/parameter=14GB

70亿个参数×每个参数占用2个字节=14GB

所以我们需要大于14GB的显存，选择 30%A100*1(24GB显存容量)，后选择立即创建，等状态栏变成运行中，点击进入开发机，我们即可开始部署。

创建lmdeploy的conda环境

conda create -n lmdeploy  python=3.10 -y
conda activate lmdeploy
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install timm==1.0.8 openai==1.40.3 lmdeploy[all]==0.5.3
pip install datasets==2.19.2

1.2 InternStudio环境获取模型

模型权重统一放置在/root/model/目录。

mkdir /root/model
# 设置开发机共享目录的软链接
ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2_5-7b-chat /root/model
ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2_5-1_8b-chat /root/model
ln -s /root/share/new_models/OpenGVLab/InternVL2-26B /root/model

这里放置到/root/model目录，链接到/root/models会提示只读文件系统：

(lmdeploy) (list) root@intern-studio-50014188:~# ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2_5-7b-chat /root/models
ln: failed to create symbolic link '/root/models/internlm2_5-7b-chat': Read-only file system

1.3 LMDeploy验证启动模型文件

在量化工作正式开始前，我们还需要验证一下获取的模型文件能否正常工作。进入创建好的conda环境并启动InternLM2_5-7b-chat：

conda activate lmdeploy
lmdeploy chat /root/model/internlm2_5-7b-chat

启动完成后，在CLI中即可和InternLM2.5尽情对话了。

double enter to end input >>> 请你给我 讲 一些冷笑话


<|im_start|>user
请讲一些冷笑话<|im_end|>
<|im_start|>assistant
好的，来一个简单的冷笑话：

为什么电脑不喜欢喝水？

因为它总是“忙”，没有时间。

希望这个笑话能让你会心一笑！

double enter to end input >>>

关于显存占用的说明：

由上文可知InternLM2.5 7B模型为bf16，LMDpeloy推理精度为bf16的7B模型权重需要占用14GB显存；如下图所示，lmdeploy默认设置cache-max-entry-count为0.8，即kv cache占用剩余显存的80%；

此时对于24GB的显卡，即30%A100，权重占用14GB显存，剩余显存24-14=10GB，因此kv cache占用10GB*0.8=8GB，加上原来的权重14GB，总共占用14+8=22GB。

而对于40GB的显卡，即50%A100，权重占用14GB，剩余显存40-14=26GB，因此kv cache占用26GB*0.8=20.8GB，加上原来的权重14GB，总共占用34.8GB。

实际加载模型后，其他项也会占用部分显存，因此剩余显存比理论偏低，实际占用会略高于22GB和34.8GB。

关于nvidia-smi和studio-smi监控显存的说明：

实验室提供的环境为虚拟化的显存，nvidia-smi是NVIDIA GPU驱动程序的一部分，用于显示NVIDIA GPU的当前状态，故当前环境只能看80GB单卡 A100 显存使用情况，无法观测虚拟化后30%或50%A100等的显存情况。针对于此，实验室提供了studio-smi 命令工具，能够观测到虚拟化后的显存使用情况。

2. LMDeploy与InternLM2.5

2.1 LMDeploy API部署InternLM2.5

在上一章节，我们直接在本地部署InternLM2.5。而在实际应用中，我们有时会将大模型封装为API接口服务，供客户端访问。

2.1.1 启动API服务器

首先让我们进入创建好的conda环境，并通下命令启动API服务器，部署InternLM2.5模型：

conda activate lmdeploy
lmdeploy serve api_server \
    /root/model/internlm2_5-7b-chat \
    --model-format hf \
    --quant-policy 0 \
    --server-name 0.0.0.0 \
    --server-port 23333 \
    --tp 1

命令解释：

lmdeploy serve api_server：这个命令用于启动API服务器。
/root/models/internlm2_5-7b-chat：这是模型的路径。
--model-format hf：这个参数指定了模型的格式。hf代表“Hugging Face”格式。
--quant-policy 0：这个参数指定了量化策略。
--server-name 0.0.0.0：这个参数指定了服务器的名称。在这里，0.0.0.0是一个特殊的IP地址，它表示所有网络接口。
--server-port 23333：这个参数指定了服务器的端口号。在这里，23333是服务器将监听的端口号。
--tp 1：这个参数表示并行数量（GPU数量）。

配置端口转发后，本机访问http://127.0.0.1:23333看到如下界面即代表部署成功。

fastAPI

2.1.2 以命令行形式连接API服务器

运行如下命令，激活conda环境并启动命令行客户端。

conda activate lmdeploy
lmdeploy serve api_client http://localhost:23333

启动成功后即可在cli界面进行对话：

(lmdeploy) (list) root@intern-studio-50014188:~/agent_camp4# lmdeploy serve api_client http://localhost:23333

double enter to end input >>> 你好  

你好！有什么我可以帮助你的吗？
double enter to end input >>> 介绍一下北京的5A级 景点

北京作为中国的首都，拥有众多5A级旅游景点，以下是其中一些著名的景点：

1. 故宫博物院：故宫是中国明清两代的皇宫，也是世界上最大的古代宫殿建筑群之一。它位于北京市中心，占地面积72万平方米，拥有9000多间房屋。

2. 长城：长城是中国古代的军事防御工程，是世界上最伟大的建筑之一。北京的八达岭长城是其中最著名的段落之一，吸引了大量国内外游客。

3. 颐和园：颐和园是中国古代皇家园林，位于北京市西郊，占地面积290公顷。它以其精美的园林设计和丰富的文化内涵而闻名。

4. 天坛：天坛是中国古代帝王祭天、祈谷的场所，位于北京市南部。它以其独特的建筑风格和宏伟的规模而闻名，是北京的重要旅游景点之一。

5. 鸟巢和水立方：鸟巢和水立方是2008年北京奥运会的主要场馆，分别用于开幕式和游泳比赛。它们以其独特的建筑设计而闻名，是北京的新地标。

除了以上景点，北京还有许多其他值得一游的地方，如颐和园、圆明园、南锣鼓巷等。希望这些信息能帮助你更好地了解北京的旅游资源。
double enter to end input >>>

2.1.3 以Gradio网页形式连接API服务器

输入以下命令，使用Gradio作为前端，启动网页。

lmdeploy serve gradio http://localhost:23333 \
    --server-name 0.0.0.0 \
    --server-port 6006

配置端口转发后在本地浏览器即可访问：

gradio

2.2 LMDeploy Lite–任务一

随着模型变得越来越大，我们需要一些大模型压缩技术来降低模型部署的成本，并提升模型的推理性能。LMDeploy 提供了权重量化和 k/v cache两种策略。

2.2.1 设置最大kv cache缓存大小

kv cache是一种缓存技术，通过存储键值对的形式来复用计算结果，以达到提高性能和降低内存消耗的目的。在大规模训练和推理中，kv cache可以显著减少重复计算量，从而提升模型的推理速度。理想情况下，kv cache全部存储于显存，以加快访存速度。

**模型在运行时，占用的显存可大致分为三部分：模型参数本身占用的显存、kv cache占用的显存，以及中间运算结果占用的显存。**LMDeploy的kv cache管理器可以通过设置–cache-max-entry-count参数，控制kv缓存占用剩余显存的最大比例。默认的比例为0.8。

设置cache-max-entry-count参数启动模型：

lmdeploy chat /root/model/internlm2_5-7b-chat --cache-max-entry-count 0.4

资源占用：

kv-cache-0.4-gpu-utils

观测显存占用情况，相比InternLM2.5正常运行占用显存23G，kv缓存设置占用剩余显存0.4，此时占用显存19G，减少了4GB。

计算一下4GB显存的减少缘何而来：

直接启动模型的显存占用情况(23GB)：

在 BF16 精度下，7B模型权重占用14GB：70×10^9 parameters×2 Bytes/parameter=14GB
kv cache占用8GB：剩余显存24-14=10GB，kv cache默认占用80%，即10*0.8=8GB
其他项1GB

是故23GB=权重占用14GB+kv cache占用8GB+其它项1GB。

对于修改kv cache占用之后的显存占用情况(19GB)：

与上述声明一致，在 BF16 精度下，7B模型权重占用14GB
kv cache占用4GB：剩余显存24-14=10GB，kv cache修改为占用40%，即10*0.4=4GB
其他项1GB

是故19GB=权重占用14GB+kv cache占用4GB+其它项1GB。而此刻减少的4GB显存占用就是从10GB*0.8-10GB*0.4=4GB，这里计算得来。

2.2.2 设置在线 kv cache int4/int8 量化

自 v0.4.0 起，LMDeploy 支持在线 kv cache int4/int8 量化，量化方式为 per-head per-token 的非对称量化。此外，通过 LMDeploy 应用 kv 量化非常简单，只需要设定 quant_policy 和cache-max-entry-count参数。目前，LMDeploy 规定 quant_policy=4 表示 kv int4 量化，quant_policy=8 表示 kv int8 量化。

启动API服务器：

lmdeploy serve api_server \
    /root/model/internlm2_5-7b-chat \
    --model-format hf \
    --quant-policy 4 \
    --cache-max-entry-count 0.4\
    --server-name 0.0.0.0 \
    --server-port 23333 \
    --tp 1

可以看到此时显存占用约19GB，相较于1.3 LMDeploy验证启动模型文件直接启动模型的显存占用情况(23GB)减少了4GB的占用。此时4GB显存的减少逻辑与2.2.1 设置最大kv cache缓存大小中4GB显存的减少一致，均因设置kv cache占用参数cache-max-entry-count至0.4而减少了4GB显存占用。

kv cache int4 int8量化

那么本节中19GB的显存占用与2.2.1 设置最大kv cache缓存大小中19GB的显存占用区别何在呢？

由于都使用BF16精度下的internlm2.5 7B模型，故剩余显存均为10GB，且 cache-max-entry-count 均为0.4，这意味着LMDeploy将分配40%的剩余显存用于kv cache，即10GB*0.4=4GB。但quant-policy 设置为4时，意味着使用int4精度进行量化。因此，LMDeploy将会使用int4精度提前开辟4GB的kv cache。相比使用BF16精度的kv cache，int4的Cache可以在相同4GB的显存下只需要4位来存储一个数值，而BF16需要16位。这意味着int4的Cache可以存储的元素数量是BF16的四倍。

2.2.3 W4A16模型量化和部署

准确说，模型量化是一种优化技术，旨在减少机器学习模型的大小并提高其推理速度。量化通过将模型的权重和激活从高精度（如16位浮点数）转换为低精度（如8位整数、4位整数、甚至二值网络）来实现。

那么标题中的W4A16又是什么意思呢？

W4：这通常表示权重量化为4位整数（int4）。这意味着模型中的权重参数将从它们原始的浮点表示（例如FP32、BF16或FP16，Internlm2.5精度为BF16）转换为4位的整数表示。这样做可以显著减少模型的大小。
A16：这表示激活（或输入/输出）仍然保持在16位浮点数（例如FP16或BF16）。激活是在神经网络中传播的数据，通常在每层运算之后产生。

因此，W4A16的量化配置意味着：

权重被量化为4位整数。
激活保持为16位浮点数。

让我们回到LMDeploy，在最新的版本中，LMDeploy使用的是AWQ算法，能够实现模型的4bit权重量化。输入以下指令，执行量化工作。(不建议运行，在InternStudio上运行需要8小时)

lmdeploy lite auto_awq \
   /root/model/internlm2_5-7b-chat \
  --calib-dataset 'ptb' \
  --calib-samples 128 \
  --calib-seqlen 2048 \
  --w-bits 4 \
  --w-group-size 128 \
  --batch-size 1 \
  --search-scale False \
  --work-dir /root/models/internlm2_5-7b-chat-w4a16-4bit

命令解释：

lmdeploy lite auto_awq: lite这是LMDeploy的命令，用于启动量化过程，而auto_awq代表自动权重量化（auto-weight-quantization）。
/root/model/internlm2_5-7b-chat: 模型文件的路径。
--calib-dataset 'ptb': 这个参数指定了一个校准数据集，这里使用的是’ptb’（Penn Treebank，一个常用的语言模型数据集）。
--calib-samples 128: 这指定了用于校准的样本数量—128个样本
--calib-seqlen 2048: 这指定了校准过程中使用的序列长度—2048
--w-bits 4: 这表示权重（weights）的位数将被量化为4位。
--work-dir /root/models/internlm2_5-7b-chat-w4a16-4bit: 这是工作目录的路径，用于存储量化后的模型和中间结果。

完成作业时请使用1.8B模型进行量化：(建议运行以下命令)

lmdeploy lite auto_awq \
   /root/model/internlm2_5-1_8b-chat \
  --calib-dataset 'ptb' \
  --calib-samples 128 \
  --calib-seqlen 2048 \
  --w-bits 4 \
  --w-group-size 128 \
  --batch-size 1 \
  --search-scale False \
  --work-dir /root/model/internlm2_5-1_8b-chat-w4a16-4bit

等终端输出如下时，说明正在推理中，稍待片刻。

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.85s/it]
Move model.tok_embeddings to GPU.
Move model.layers.0 to CPU.
Move model.layers.1 to CPU.
Move model.layers.2 to CPU.
Move model.layers.3 to CPU.
Move model.layers.4 to CPU.
Move model.layers.5 to CPU.
Move model.layers.6 to CPU.
Move model.layers.7 to CPU.
Move model.layers.8 to CPU.
Move model.layers.9 to CPU.
Move model.layers.10 to CPU.
Move model.layers.11 to CPU.
Move model.layers.12 to CPU.
Move model.layers.13 to CPU.
Move model.layers.14 to CPU.
Move model.layers.15 to CPU.
Move model.layers.16 to CPU.
Move model.layers.17 to CPU.
Move model.layers.18 to CPU.
Move model.layers.19 to CPU.
Move model.layers.20 to CPU.
Move model.layers.21 to CPU.
Move model.layers.22 to CPU.
Move model.layers.23 to CPU.
Move model.norm to GPU.
Move output to CPU.
Loading calibrate dataset ...
/root/list/lib/python3.11/site-packages/datasets/load.py:1491: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ptb_text_only
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
model.layers.0, samples: 128, max gpu memory: 3.49 GB
model.layers.1, samples: 128, max gpu memory: 4.49 GB
model.layers.2, samples: 128, max gpu memory: 4.49 GB
model.layers.3, samples: 128, max gpu memory: 4.49 GB
model.layers.4, samples: 128, max gpu memory: 4.49 GB
model.layers.5, samples: 128, max gpu memory: 4.49 GB

model.layers.6, samples: 128, max gpu memory: 4.49 GB
model.layers.7, samples: 128, max gpu memory: 4.49 GB
model.layers.8, samples: 128, max gpu memory: 4.49 GB
model.layers.9, samples: 128, max gpu memory: 4.49 GB
model.layers.10, samples: 128, max gpu memory: 4.49 GB
model.layers.11, samples: 128, max gpu memory: 4.49 GB
model.layers.12, samples: 128, max gpu memory: 4.49 GB
model.layers.13, samples: 128, max gpu memory: 4.49 GB
model.layers.14, samples: 128, max gpu memory: 4.49 GB
model.layers.15, samples: 128, max gpu memory: 4.49 GB
model.layers.16, samples: 128, max gpu memory: 4.49 GB
model.layers.17, samples: 128, max gpu memory: 4.49 GB
model.layers.18, samples: 128, max gpu memory: 4.49 GB
model.layers.19, samples: 128, max gpu memory: 4.49 GB
model.layers.20, samples: 128, max gpu memory: 4.49 GB
model.layers.21, samples: 128, max gpu memory: 4.49 GB
model.layers.22, samples: 128, max gpu memory: 4.49 GB
model.layers.23, samples: 128, max gpu memory: 4.49 GB
model.layers.0 smooth weight done.
model.layers.1 smooth weight done.
model.layers.2 smooth weight done.
model.layers.3 smooth weight done.
model.layers.4 smooth weight done.
model.layers.5 smooth weight done.
model.layers.6 smooth weight done.
model.layers.7 smooth weight done.
model.layers.8 smooth weight done.
model.layers.9 smooth weight done.
model.layers.10 smooth weight done.
model.layers.11 smooth weight done.
model.layers.12 smooth weight done.
model.layers.13 smooth weight done.
model.layers.14 smooth weight done.
model.layers.15 smooth weight done.
model.layers.16 smooth weight done.
model.layers.17 smooth weight done.
model.layers.18 smooth weight done.
model.layers.19 smooth weight done.
model.layers.20 smooth weight done.
model.layers.21 smooth weight done.
model.layers.22 smooth weight done.
model.layers.23 smooth weight done.
model.layers.0.attention.wqkv weight packed.
model.layers.0.attention.wo weight packed.
model.layers.0.feed_forward.w1 weight packed.
model.layers.0.feed_forward.w3 weight packed.
model.layers.0.feed_forward.w2 weight packed.
model.layers.1.attention.wqkv weight packed.
model.layers.1.attention.wo weight packed.
model.layers.1.feed_forward.w1 weight packed.
model.layers.1.feed_forward.w3 weight packed.
model.layers.1.feed_forward.w2 weight packed.
model.layers.2.attention.wqkv weight packed.
model.layers.2.attention.wo weight packed.
model.layers.2.feed_forward.w1 weight packed.
model.layers.2.feed_forward.w3 weight packed.
model.layers.2.feed_forward.w2 weight packed.
model.layers.3.attention.wqkv weight packed.
model.layers.3.attention.wo weight packed.
model.layers.3.feed_forward.w1 weight packed.
model.layers.3.feed_forward.w3 weight packed.
model.layers.3.feed_forward.w2 weight packed.
model.layers.4.attention.wqkv weight packed.
model.layers.4.attention.wo weight packed.
model.layers.4.feed_forward.w1 weight packed.
model.layers.4.feed_forward.w3 weight packed.
model.layers.4.feed_forward.w2 weight packed.
model.layers.5.attention.wqkv weight packed.
model.layers.5.attention.wo weight packed.
model.layers.5.feed_forward.w1 weight packed.
model.layers.5.feed_forward.w3 weight packed.
model.layers.5.feed_forward.w2 weight packed.
model.layers.6.attention.wqkv weight packed.
model.layers.6.attention.wo weight packed.
model.layers.6.feed_forward.w1 weight packed.
model.layers.6.feed_forward.w3 weight packed.
model.layers.6.feed_forward.w2 weight packed.
model.layers.7.attention.wqkv weight packed.
model.layers.7.attention.wo weight packed.
model.layers.7.feed_forward.w1 weight packed.
model.layers.7.feed_forward.w3 weight packed.
model.layers.7.feed_forward.w2 weight packed.
model.layers.8.attention.wqkv weight packed.
model.layers.8.attention.wo weight packed.
model.layers.8.feed_forward.w1 weight packed.
model.layers.8.feed_forward.w3 weight packed.
model.layers.8.feed_forward.w2 weight packed.
model.layers.9.attention.wqkv weight packed.
model.layers.9.attention.wo weight packed.
model.layers.9.feed_forward.w1 weight packed.
model.layers.9.feed_forward.w3 weight packed.
model.layers.9.feed_forward.w2 weight packed.
model.layers.10.attention.wqkv weight packed.
model.layers.10.attention.wo weight packed.
model.layers.10.feed_forward.w1 weight packed.
model.layers.10.feed_forward.w3 weight packed.
model.layers.10.feed_forward.w2 weight packed.
model.layers.11.attention.wqkv weight packed.
model.layers.11.attention.wo weight packed.
model.layers.11.feed_forward.w1 weight packed.
model.layers.11.feed_forward.w3 weight packed.
model.layers.11.feed_forward.w2 weight packed.
model.layers.12.attention.wqkv weight packed.
model.layers.12.attention.wo weight packed.
model.layers.12.feed_forward.w1 weight packed.
model.layers.12.feed_forward.w3 weight packed.
model.layers.12.feed_forward.w2 weight packed.
model.layers.13.attention.wqkv weight packed.
model.layers.13.attention.wo weight packed.
model.layers.13.feed_forward.w1 weight packed.
model.layers.13.feed_forward.w3 weight packed.
model.layers.13.feed_forward.w2 weight packed.
model.layers.14.attention.wqkv weight packed.
model.layers.14.attention.wo weight packed.
model.layers.14.feed_forward.w1 weight packed.
model.layers.14.feed_forward.w3 weight packed.
model.layers.14.feed_forward.w2 weight packed.
model.layers.15.attention.wqkv weight packed.
model.layers.15.attention.wo weight packed.
model.layers.15.feed_forward.w1 weight packed.
model.layers.15.feed_forward.w3 weight packed.
model.layers.15.feed_forward.w2 weight packed.
model.layers.16.attention.wqkv weight packed.
model.layers.16.attention.wo weight packed.
model.layers.16.feed_forward.w1 weight packed.
model.layers.16.feed_forward.w3 weight packed.
model.layers.16.feed_forward.w2 weight packed.
model.layers.17.attention.wqkv weight packed.
model.layers.17.attention.wo weight packed.
model.layers.17.feed_forward.w1 weight packed.
model.layers.17.feed_forward.w3 weight packed.
model.layers.17.feed_forward.w2 weight packed.
model.layers.18.attention.wqkv weight packed.
model.layers.18.attention.wo weight packed.
model.layers.18.feed_forward.w1 weight packed.
model.layers.18.feed_forward.w3 weight packed.
model.layers.18.feed_forward.w2 weight packed.
model.layers.19.attention.wqkv weight packed.
model.layers.19.attention.wo weight packed.
model.layers.19.feed_forward.w1 weight packed.
model.layers.19.feed_forward.w3 weight packed.
model.layers.19.feed_forward.w2 weight packed.
model.layers.20.attention.wqkv weight packed.
model.layers.20.attention.wo weight packed.
model.layers.20.feed_forward.w1 weight packed.
model.layers.20.feed_forward.w3 weight packed.
model.layers.20.feed_forward.w2 weight packed.
model.layers.21.attention.wqkv weight packed.
model.layers.21.attention.wo weight packed.
model.layers.21.feed_forward.w1 weight packed.
model.layers.21.feed_forward.w3 weight packed.
model.layers.21.feed_forward.w2 weight packed.
model.layers.22.attention.wqkv weight packed.
model.layers.22.attention.wo weight packed.
model.layers.22.feed_forward.w1 weight packed.
model.layers.22.feed_forward.w3 weight packed.
model.layers.22.feed_forward.w2 weight packed.
model.layers.23.attention.wqkv weight packed.
model.layers.23.attention.wo weight packed.
model.layers.23.feed_forward.w1 weight packed.
model.layers.23.feed_forward.w3 weight packed.
model.layers.23.feed_forward.w2 weight packed.

说明

如果此处出现报错：TypeError: ‘NoneType’ object is not callable，原因是当前版本的 datasets3.0 无法下载calibrate数据集在命令前加一行 pip install datasets==2.19.2 可以解决。

等待推理完成，便可以直接在你设置的目标文件夹看到对应的模型文件。

那么推理后的模型和原本的模型区别在哪里呢？最明显的两点是模型文件大小以及占据显存大小。

对比模型文件大小：

# 量化后的模型
(lmdeploy) (list) root@intern-studio-50014188:~# du -sh /root/model/internlm2_5-1_8b-chat-w4a16-4bit
1.5G    /root/model/internlm2_5-1_8b-chat-w4a16-4bit

# 原模型
(list) (base) root@intern-studio-50014188:~# du -sh /root/share/new_models/Shanghai_AI_Laboratory/internlm2_5-1_8b-chat
3.6G    /root/share/new_models/Shanghai_AI_Laboratory/internlm2_5-1_8b-chat

显存占用大小：

# 量化后的模型
(lmdeploy) (list) root@intern-studio-50014188:~# lmdeploy chat /root/model/internlm2_5-1_8b-chat-w4a16-4bit/ --model-format awq
chat_template_config:
ChatTemplateConfig(model_name='internlm2', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability='chat', stop_words=None)
engine_cfg:
TurbomindEngineConfig(model_name='/root/model/internlm2_5-1_8b-chat-w4a16-4bit/', model_format='awq', tp=1, session_len=32768, max_batch_size=1, cache_max_entry_count=0.8, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
[WARNING] gemm_config.in is not found; using default GEMM algo                                                                                                                                   

double enter to end input >>> 

# 在另外的终端查看显存占用大小
(list) (base) root@intern-studio-50014188:~# studio-smi 
Running studio-smi by vgpu-smi
Thu Jan 02 12:12:04 2025
+------------------------------------------------------------------------------+
| VGPU-SMI 1.10.33      Driver Version: 535.54.03     CUDA Version: 12.2       |
+-------------------------------------------+----------------------------------+
| GPU  Name                Bus-Id           |        Memory-Usage     GPU-Util |
|===========================================+==================================|
|   0  NVIDIA A100-SXM...  00000000:B3:0... | 20196MiB / 24566MiB    0% /  30% |
+-------------------------------------------+----------------------------------+

+------------------------------------------------------------------------------+
| Processes:                                                                   |
|                                                         GPU Memory      GPU  |
|  GPU   PID   Type   Process name                        Usage           Util |
|==============================================================================|
|   0 139782      C   /root/list/bin/python               20196MiB         0%  |
+------------------------------------------------------------------------------+

# 量化前的模型
(lmdeploy) (list) root@intern-studio-50014188:~# lmdeploy chat /root/model/internlm2_5-1_8b-chat

# 在另外的终端查看显存占用大小
(list) (base) root@intern-studio-50014188:~# studio-smi 
Running studio-smi by vgpu-smi
Thu Jan 02 12:21:28 2025
+------------------------------------------------------------------------------+
| VGPU-SMI 1.10.33      Driver Version: 535.54.03     CUDA Version: 12.2       |
+-------------------------------------------+----------------------------------+
| GPU  Name                Bus-Id           |        Memory-Usage     GPU-Util |
|===========================================+==================================|
|   0  NVIDIA A100-SXM...  00000000:B3:0... | 20624MiB / 24566MiB    0% /  30% |
+-------------------------------------------+----------------------------------+

+------------------------------------------------------------------------------+
| Processes:                                                                   |
|                                                         GPU Memory      GPU  |
|  GPU   PID   Type   Process name                        Usage           Util |
|==============================================================================|
|   0 149884      C   /root/list/bin/python               20624MiB         0%  |
+------------------------------------------------------------------------------+

可以发现，相比较于原先的20624MiB显存占用，W4A16量化后的模型少了约400+MiB的显存占用。

2.2.4 W4A16 量化+ KV cache+KV cache 量化

输入以下指令，让我们同时启用量化后的模型、设定kv cache占用和kv cache int4量化。

lmdeploy serve api_server \
    /root/model/internlm2_5-1_8b-chat-w4a16-4bit/ \
    --model-format awq \
    --quant-policy 4 \
    --cache-max-entry-count 0.4 \
    --server-name 0.0.0.0 \
    --server-port 23333 \
    --tp 1

此时显存占用：

(list) (base) root@intern-studio-50014188:~# studio-smi 
Running studio-smi by vgpu-smi
Thu Jan 02 12:28:20 2025
+------------------------------------------------------------------------------+
| VGPU-SMI 1.10.33      Driver Version: 535.54.03     CUDA Version: 12.2       |
+-------------------------------------------+----------------------------------+
| GPU  Name                Bus-Id           |        Memory-Usage     GPU-Util |
|===========================================+==================================|
|   0  NVIDIA A100-SXM...  00000000:B3:0... | 11364MiB / 24566MiB    0% /  30% |
+-------------------------------------------+----------------------------------+

+------------------------------------------------------------------------------+
| Processes:                                                                   |
|                                                         GPU Memory      GPU  |
|  GPU   PID   Type   Process name                        Usage           Util |
|==============================================================================|
|   0 156473      C   /root/list/bin/python               11364MiB         0%  |
+------------------------------------------------------------------------------+

此时显存占用11364MiB。

3. LMDeploy与InternVL2

3.1 LMDeploy Lite

InternVL2-26B需要约70+GB显存，但是为了让我们能够在30%A100上运行，需要先进行量化操作，这也是量化本身的意义所在——即降低模型部署成本。

3.1.1 W4A16 模型量化和部署

针对InternVL系列模型，让我们先进入conda环境，并输入以下指令，执行模型的量化工作。(本步骤耗时较长，请耐心等待)

conda activate lmdeploy
lmdeploy lite auto_awq \
   /root/model/InternVL2-26B \
  --calib-dataset 'ptb' \
  --calib-samples 128 \
  --calib-seqlen 2048 \
  --w-bits 4 \
  --w-group-size 128 \
  --batch-size 1 \
  --search-scale False \
  --work-dir /root/model/InternVL2-26B-w4a16-4bit

3.1.2 W4A16 量化+ KV cache+KV cache 量化

输入以下指令，让我们启用量化后的模型。

lmdeploy serve api_server \
    /root/model/InternVL2-26B-w4a16-4bit \
    --model-format awq \
    --quant-policy 4 \
    --cache-max-entry-count 0.1\
    --server-name 0.0.0.0 \
    --server-port 23333 \
    --tp 1

启动后观测显存占用情况，此时只需要约23.8GB的显存，已经是一张30%A100即可部署的模型了。

InternVL2 Model Card

根据InternVL2介绍，InternVL2 26B是由一个6B的ViT、一个100M的MLP以及一个19.86B的internlm组成的。

如果此时推理图片，则会显示剩余显存不足，这是因为推理图片的时候pytorch会占用额外的激活显存，需要开启50%A100进行图片推理。

3.2 LMDeploy API部署InternVL2

通过以下命令启动API服务器，部署InternVL2模型：

lmdeploy serve api_server \
    /root/model/InternVL2-26B-w4a16-4bit/ \
    --model-format awq \
    --quant-policy 4 \
    --cache-max-entry-count 0.1 \
    --server-name 0.0.0.0 \
    --server-port 23333 \
    --tp 1

其余步骤参考2.1章节剩余部分内容。

4. LMDeploy之FastAPI与Function call–任务二

4.1 API开发

进入创建好的conda环境并输入指令启动API服务器。

conda activate lmdeploy
lmdeploy serve api_server \
    /root/models/internlm2_5-1_8b-chat-w4a16-4bit \
    --model-format awq \
    --cache-max-entry-count 0.4 \
    --quant-policy 4 \
    --server-name 0.0.0.0 \
    --server-port 23333 \
    --tp 1

新建一个终端，然后新建internlm2_5.py文件：

# 导入openai模块中的OpenAI类，这个类用于与OpenAI API进行交互
from openai import OpenAI


# 创建一个OpenAI的客户端实例，需要传入API密钥和API的基础URL
client = OpenAI(
    api_key='YOUR_API_KEY',  
    # 替换为你的OpenAI API密钥，由于我们使用的本地API，无需密钥，任意填写即可
    base_url="http://0.0.0.0:23333/v1"  
    # 指定API的基础URL，这里使用了本地地址和端口
)

# 调用client.models.list()方法获取所有可用的模型，并选择第一个模型的ID
# models.list()返回一个模型列表，每个模型都有一个id属性
model_name = client.models.list().data[0].id

# 使用client.chat.completions.create()方法创建一个聊天补全请求
# 这个方法需要传入多个参数来指定请求的细节
response = client.chat.completions.create(
  model=model_name,  
  # 指定要使用的模型ID
  messages=[  
  # 定义消息列表，列表中的每个字典代表一个消息
    {"role": "system", "content": "你是一个友好的小助手，负责解决问题."},  
    # 系统消息，定义助手的行为
    {"role": "user", "content": "帮我讲述一个关于狐狸和西瓜的小故事"},  
    # 用户消息，询问时间管理的建议
  ],
    temperature=0.8,  
    # 控制生成文本的随机性，值越高生成的文本越随机
    top_p=0.8  
    # 控制生成文本的多样性，值越高生成的文本越多样
)

# 打印出API的响应结果
print(response.choices[0].message.content)

在新建终端输入以下指令激活环境并运行python代码:

conda activate lmdeploy
python /root/internlm2_5.py

查看终端输出结果：

4.2 Function call

关于Function call，即函数调用功能，它允许开发者在调用模型时，详细说明函数的作用，并使模型能够智能地根据用户的提问来输入参数并执行函数。完成调用后，模型会将函数的输出结果作为回答用户问题的依据。

首先让我们进入创建好的conda环境并启动API服务器。

conda activate lmdeploy
lmdeploy serve api_server \
    /root/model/internlm2_5-7b-chat \
    --model-format hf \
    --quant-policy 0 \
    --server-name 0.0.0.0 \
    --server-port 23333 \
    --tp 1

目前LMDeploy在0.5.3版本中支持了对InternLM2, InternLM2.5和llama3.1这三个模型，故我们选用InternLM2.5 封装API。

新建文件internlm2_5_func.py，内容如下：

from openai import OpenAI


def add(a: int, b: int):
    return a + b


def mul(a: int, b: int):
    return a * b


tools = [{
    'type': 'function',
    'function': {
        'name': 'add',
        'description': 'Compute the sum of two numbers',
        'parameters': {
            'type': 'object',
            'properties': {
                'a': {
                    'type': 'int',
                    'description': 'A number',
                },
                'b': {
                    'type': 'int',
                    'description': 'A number',
                },
            },
            'required': ['a', 'b'],
        },
    }
}, {
    'type': 'function',
    'function': {
        'name': 'mul',
        'description': 'Calculate the product of two numbers',
        'parameters': {
            'type': 'object',
            'properties': {
                'a': {
                    'type': 'int',
                    'description': 'A number',
                },
                'b': {
                    'type': 'int',
                    'description': 'A number',
                },
            },
            'required': ['a', 'b'],
        },
    }
}]
messages = [{'role': 'user', 'content': 'Compute (3+5)*2'}]

client = OpenAI(api_key='Bearer123', base_url='http://0.0.0.0:23333/v1')	# 这里修改了api_key
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.8,
    top_p=0.8,
    stream=False,
    tools=tools)
print(response)
func1_name = response.choices[0].message.tool_calls[0].function.name
func1_args = response.choices[0].message.tool_calls[0].function.arguments
func1_out = eval(f'{func1_name}(**{func1_args})')
print(func1_out)

messages.append({
    'role': 'assistant',
    'content': response.choices[0].message.content
})
messages.append({
    'role': 'environment',
    'content': f'3+5={func1_out}',
    'name': 'plugin'
})
response = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.8,
    top_p=0.8,
    stream=False,
    tools=tools)
print(response)
func2_name = response.choices[0].message.tool_calls[0].function.name
func2_args = response.choices[0].message.tool_calls[0].function.arguments
func2_out = eval(f'{func2_name}(**{func2_args})')
print(func2_out)

输入以下指令运行python代码:

(lmdeploy) (list) root@intern-studio-50014188:~# python /root/internlm2_5_func.py
ChatCompletion(id='2', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content='I will call the API to calculate the result of the given expression.', refusal=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='0', function=Function(arguments='{"a": 3, "b": 5}', name='add'), type='function')]))], created=1735868544, model='/root/model/internlm2_5-7b-chat', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=39, prompt_tokens=263, total_tokens=302))
8
ChatCompletion(id='3', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content='The result of the first step is 8. Now I will call the API again to calculate the result of the entire expression.', refusal=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='1', function=Function(arguments='{"a": 8, "b": 2}', name='mul'), type='function')]))], created=1735868545, model='/root/model/internlm2_5-7b-chat', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=51, prompt_tokens=296, total_tokens=347))
16

我们可以看出InternLM2.5将输入'Compute (3+5)*2'根据提供的function拆分成了"加"和"乘"两步，第一步调用function add实现加，再于第二步调用function mul实现乘，再最终输出结果16。

如果遇到如下问题，更换httpx版本为0.27.2重试：

(lmdeploy) (list) root@intern-studio-50014188:~# python /root/internlm2_5_func.py
Traceback (most recent call last):
  File "/root/internlm2_5_func.py", line 55, in <module>
    client = OpenAI(api_key='', base_url='http://0.0.0.0:23333/v1')
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/list/lib/python3.11/site-packages/openai/_client.py", line 123, in __init__
    super().__init__(
  File "/root/list/lib/python3.11/site-packages/openai/_base_client.py", line 843, in __init__
    self._client = http_client or SyncHttpxClientWrapper(
                                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/list/lib/python3.11/site-packages/openai/_base_client.py", line 741, in __init__
    super().__init__(**kwargs)
TypeError: Client.__init__() got an unexpected keyword argument 'proxies'
(lmdeploy) (list) root@intern-studio-50014188:~# 

(lmdeploy) (list) root@intern-studio-50014188:~# pip list | grep httpx
httpx                         0.28.1
safehttpx                     0.1.6

(lmdeploy) (list) root@intern-studio-50014188:~# pip install httpx==0.27.2

url=‘http://0.0.0.0:23333/v1’)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/list/lib/python3.11/site-packages/openai/_client.py”, line 123, in init
super().init(
File “/root/list/lib/python3.11/site-packages/openai/_base_client.py”, line 843, in init
self._client = http_client or SyncHttpxClientWrapper(
^^^^^^^^^^^^^^^^^^^^^^^
File “/root/list/lib/python3.11/site-packages/openai/_base_client.py”, line 741, in init
super().init(**kwargs)
TypeError: Client.init() got an unexpected keyword argument ‘proxies’
(lmdeploy) (list) root@intern-studio-50014188:~#