InternLM/lmdeploy 大语言模型离线推理 Pipeline 使用指南-优快云博客

InternLM/lmdeploy 大语言模型离线推理 Pipeline 使用指南

【免费下载链接】lmdeploy LMDeploy is a toolkit for compressing, deploying, and serving LLMs. 项目地址: https://gitcode.com/gh_mirrors/lm/lmdeploy

前言

在现代人工智能应用中，大语言模型(LLM)的推理部署是一个关键环节。InternLM/lmdeploy 项目提供了一个高效的 pipeline 接口，使得开发者能够轻松地进行大语言模型的离线推理。本文将深入解析 pipeline 的使用方法，帮助开发者快速上手并解决常见问题。

Pipeline 基础使用

快速入门示例

让我们从一个最简单的"Hello World"示例开始：

from lmdeploy import pipeline

# 初始化 pipeline，加载 internlm2_5-7b-chat 模型
pipe = pipeline('internlm/internlm2_5-7b-chat')

# 执行推理
response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
print(response)

这个例子展示了 pipeline 最基本的使用方式。值得注意的是，pipeline 会自动管理显存分配，特别是用于存储推理过程中产生的键值(k/v)缓存。

显存管理策略

InternLM/lmdeploy 在不同版本中对显存管理策略有所调整：

v0.2.0-v0.2.1版本：
- 默认分配GPU总显存的50%给k/v缓存
- 对于7B模型，显存小于40G可能出现OOM
- 可通过调整cache_max_entry_count参数优化
v0.2.1以上版本：
- 改为从空闲显存中按比例分配
- 默认比例提升至0.8
- 更灵活的内存管理方式

当遇到OOM错误时，可以这样调整k/v缓存比例：

from lmdeploy import pipeline, TurbomindEngineConfig

# 将k/v缓存比例调整为20%
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)

pipe = pipeline('internlm/internlm2_5-7b-chat',
                backend_config=backend_config)

高级功能配置

多卡并行推理

对于大模型，我们可以利用多GPU并行计算：

from lmdeploy import pipeline, TurbomindEngineConfig

# 配置使用2张GPU进行张量并行
backend_config = TurbomindEngineConfig(tp=2)
pipe = pipeline('internlm/internlm2_5-7b-chat',
                backend_config=backend_config)

采样参数设置

控制文本生成的质量和多样性：

from lmdeploy import pipeline, GenerationConfig

gen_config = GenerationConfig(
    top_p=0.8,        # 核采样概率阈值
    top_k=40,         # 保留的最高概率token数量
    temperature=0.8,  # 温度参数，控制随机性
    max_new_tokens=1024  # 最大生成token数
)
response = pipe(['Hi, pls intro yourself'], gen_config=gen_config)

OpenAI格式Prompt支持

兼容OpenAI API风格的对话格式：

prompts = [[{
    'role': 'user',
    'content': 'Hi, pls intro yourself'
}]]
response = pipe(prompts)

进阶功能

流式输出

对于长文本生成，可以使用流式输出：

for item in pipe.stream_infer(prompts):
    print(item)  # 实时输出生成结果

获取模型内部信息

获取生成token的logits：

gen_config = GenerationConfig(output_logits='generation', max_new_tokens=10)
response = pipe(['Hi'], gen_config=gen_config)
logits = [x.logits for x in response]  # 获取每个token的logits

获取隐藏层状态：

gen_config = GenerationConfig(output_last_hidden_state='generation')
response = pipe(['Hi'], gen_config=gen_config)
hidden_states = [x.last_hidden_state for x in response]

计算困惑度(PPL)

评估模型对输入文本的拟合程度：

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('internlm/internlm2_5-7b-chat', trust_remote_code=True)
messages = [{"role": "user", "content": "Hello, how are you?"}]
input_ids = tokenizer.apply_chat_template(messages)

# 获取logits
logits = pipe.get_logits(input_ids)

# 计算困惑度(实际返回的是交叉熵损失)
ppl = pipe.get_ppl(input_ids)

特殊场景处理

PyTorch引擎使用

在某些场景下可能需要直接使用PyTorch引擎：

from lmdeploy import pipeline, PytorchEngineConfig

backend_config = PytorchEngineConfig(session_len=2048)
pipe = pipeline('internlm/internlm2_5-7b-chat',
                backend_config=backend_config)

注意：使用前需安装triton>=2.1.0。

LoRA模型推理

支持轻量级的模型适配：

backend_config = PytorchEngineConfig(
    session_len=2048,
    adapters={'lora_name': 'path/to/lora'}
)
pipe = pipeline('base_model',
                backend_config=backend_config)
response = pipe(prompts, adapter_name='lora_name')

资源释放

显式释放pipeline占用的资源：

# 方式1：显式调用close()
pipe.close()

# 方式2：使用with语句自动管理
with pipeline('internlm/internlm2_5-7b-chat') as pipe:
    response = pipe(['Hi'])

常见问题解决

多进程错误：
- 当使用tp>1和PyTorch后端时，确保脚本有主入口：
```
if __name__ == '__main__':
```
自定义对话模板：
- 需要参考项目文档中的chat_template指南进行配置
LoRA权重对话模板：
- 如果LoRA权重有对应对话模板，需先注册到lmdeploy

最佳实践建议

对于生产环境，建议使用with语句管理pipeline生命周期
长文本生成时优先考虑流式输出，提升用户体验
多GPU环境下合理配置tp参数，平衡计算和通信开销
根据任务需求精细调整生成参数，平衡生成质量和速度

通过本指南，开发者可以全面了解InternLM/lmdeploy的pipeline功能，并能够根据实际需求进行灵活配置和优化。

【免费下载链接】lmdeploy LMDeploy is a toolkit for compressing, deploying, and serving LLMs. 项目地址: https://gitcode.com/gh_mirrors/lm/lmdeploy

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考