第五节 LMdeploy量化部署LLM&VLM实践笔记与作业

kding123

已于 2024-04-22 16:02:41 修改

阅读量339

点赞数 1

文章标签： pytorch 人工智能 python

于 2024-04-16 18:02:04 首次发布

本文链接：https://blog.youkuaiyun.com/kding123/article/details/137689196

版权

文章讨论了大语言模型部署中的关键策略，如模型剪枝和知识蒸馏，以及LMDeployTurboMind引擎在提高推理速度和效率方面的优势。还介绍了如何使用Transformer库和LMDeploy进行模型对话。后续还有模型量化的进阶内容。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

笔记

在这里插入图片描述

大语言模型部署面临的挑战

在这里插入图片描述
2.
3.

大模型部署方法

模型剪枝（Pruning）
剪枝指移除模型中不必要或多余的组件，比如参数，以使模型更加高效。通过对模型中贡献有限的冗余参数进行剪枝，在保证性能最低下降的同时，可以减小存储需求、提高计算效率。
知识蒸馏（Knowledge Distillation，KD）
知识蒸馏是一种经典的模型压缩方法，核心思想是通过引导轻量化的学生模型“模仿”性能更好、结构更复杂的教师模型，在不改变学生模型结构的情况下提高其性能。
模型量化
量化技术将传统的表示方法中的浮点数转换为整数或其他离散形式，以减轻深度学习模型的存储和计算负担。

LMDeploy

在这里插入图片描述
LMDeploy TurboMind 引擎拥有卓越的推理能力，在各种规模的模型上，每秒处理的请求数是 VLLM的1.36~1.85 倍。在静态推理能力方面，TurboMind 4bit 模型推理速度(out token/s)远高于FP16/BF16推理。在小batch时，提高到2.4倍。
新版本的lmdeploy支持了对多模态大模型llava的支持。可以使用pipeline便捷运行。
支持模型：
在这里插入图片描述

1.LMDeploy模型对话(chat)

环境

conda activate lmdeploy

pip install lmdeploy[all]==0.3.0

1.1 使用Transformer库运行模型

代码：

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("/root/models/internlm2-chat-1_8b", trust_remote_code=True)

# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("/root/models/internlm2-chat-1_8b", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()

inp = "hello"
print("[INPUT]", inp)
response, history = model.chat(tokenizer, inp, history=[])
print("[OUTPUT]", response)

inp = "please provide three suggestions about time management"
print("[INPUT]", inp)
response, history = model.chat(tokenizer, inp, history=history)
print("[OUTPUT]", response)

运行效果：
在这里插入图片描述

1.2 使用LMDeploy与模型对话

命令行：

#lmdeploy chat [HF格式模型路径/TurboMind格式模型路径]
lmdeploy chat /root/models/internlm2-chat-1_8b

在这里插入图片描述
明显的感知就是速度变快。