InternLM/LMDeploy 离线推理管道技术详解-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00099/article/details/148467867

InternLM/LMDeploy 离线推理管道技术详解

lmdeploy LMDeploy is a toolkit for compressing, deploying, and serving LLMs. 项目地址: https://gitcode.com/gh_mirrors/lm/lmdeploy

概述

InternLM/LMDeploy 是一个高效的大语言模型部署工具包，其中的 pipeline 模块提供了便捷的离线推理功能。本文将深入解析其核心特性和使用方法，帮助开发者快速上手并优化模型推理性能。

基础使用

快速入门示例

from lmdeploy import pipeline

# 初始化管道，加载模型
pipe = pipeline('internlm/internlm2_5-7b-chat')

# 执行推理
response = pipe(['你好，请自我介绍', '上海是'])
print(response)

这个简单示例展示了最基本的用法，但背后隐藏着重要的内存管理机制。LMDeploy 默认会为键值缓存(k/v cache)分配一定比例的GPU内存，这个比例由 TurbomindEngineConfig.cache_max_entry_count 参数控制。

关键技术解析

内存管理策略演进

LMDeploy 在不同版本中对k/v缓存的内存分配策略进行了优化：

v0.2.0-v0.2.1版本
- 默认分配50%的GPU总内存给k/v缓存
- 在小于40G显存的GPU上运行7B模型可能出现OOM错误
- 解决方案：降低缓存比例

from lmdeploy import pipeline, TurbomindEngineConfig

# 将k/v缓存比例降至20%
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)
pipe = pipeline('internlm/internlm2_5-7b-chat',
               backend_config=backend_config)

v0.2.1以上版本
- 改为从GPU空闲内存中按比例分配
- 默认比例调整为0.8
- 如遇OOM，同样可通过降低比例解决

并行计算配置

对于大模型推理，张量并行(Tensor Parallelism)能显著提升效率：

from lmdeploy import pipeline, TurbomindEngineConfig

# 设置2路张量并行
backend_config = TurbomindEngineConfig(tp=2)
pipe = pipeline('internlm/internlm2_5-7b-chat',
               backend_config=backend_config)

高级功能

采样参数调优

通过 GenerationConfig 可以精细控制生成文本的质量和多样性：

from lmdeploy import pipeline, GenerationConfig

gen_config = GenerationConfig(
    top_p=0.8,        # 核采样概率阈值
    top_k=40,         # 保留的最高概率token数
    temperature=0.8,  # 温度参数
    max_new_tokens=1024  # 最大生成长度
)
response = pipe(['你好'], gen_config=gen_config)

OpenAI格式提示

支持符合OpenAI API标准的对话格式：

prompts = [[{
    'role': 'user',
    'content': '你好，请自我介绍'
}]]
response = pipe(prompts)

流式输出

对于长文本生成，流式输出能提升用户体验：

for item in pipe.stream_infer(prompts):
    print(item)  # 实时输出生成结果

模型分析工具

获取生成token的logits

gen_config = GenerationConfig(output_logits='generation', max_new_tokens=10)
response = pipe(['你好'], gen_config=gen_config)
logits = [x.logits for x in response]  # 获取每个token的预测分数

计算困惑度(Perplexity)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('internlm/internlm2_5-7b-chat', trust_remote_code=True)
messages = [{"role": "user", "content": "你好吗？"}]
input_ids = tokenizer.apply_chat_template(messages)

ppl = pipe.get_ppl(input_ids)  # 获取交叉熵损失(未取指数)

注意：长文本可能导致OOM，需谨慎使用。

引擎选择

PyTorch引擎

from lmdeploy import pipeline, PytorchEngineConfig

backend_config = PytorchEngineConfig(session_len=2048)  # 设置会话长度
pipe = pipeline('internlm/internlm2_5-7b-chat',
               backend_config=backend_config)

LoRA适配器支持

backend_config = PytorchEngineConfig(
    session_len=2048,
    adapters={'lora_name': 'path/to/lora'}
)
pipe = pipeline('THUDM/chatglm2-6b', backend_config=backend_config)
response = pipe(prompts, adapter_name='lora_name')

资源管理

建议使用上下文管理器自动释放资源：

with pipeline('internlm/internlm2_5-7b-chat') as pipe:
    response = pipe(['你好'])
    print(response)

常见问题解决

多进程启动错误
- 确保主程序有 if __name__ == '__main__': 保护
- 这是Python多进程编程的最佳实践
自定义对话模板
- 需要参考专门的对话模板配置文档
LoRA适配器使用
- 如果LoRA权重有对应的对话模板，可先注册模板再使用

通过本文的详细介绍，开发者可以充分利用LMDeploy pipeline的强大功能，实现高效稳定的大模型离线推理。

lmdeploy LMDeploy is a toolkit for compressing, deploying, and serving LLMs. 项目地址: https://gitcode.com/gh_mirrors/lm/lmdeploy

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考