LangChain-KR项目实战：使用Hugging Face本地Pipeline运行大语言模型-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00388/article/details/148970634

LangChain-KR项目实战：使用Hugging Face本地Pipeline运行大语言模型

langchain-kr LangChain 공식 Document, Cookbook, 그 밖의 실용 예제를 바탕으로 작성한 한국어 튜토리얼입니다. 본 튜토리얼을 통해 LangChain을 더 쉽고 효과적으로 사용하는 방법을 배울 수 있습니다. 项目地址: https://gitcode.com/gh_mirrors/la/langchain-kr

引言

在现代自然语言处理(NLP)领域，Hugging Face已成为开源模型和工具的事实标准。本文将详细介绍如何在LangChain-KR项目中利用HuggingFacePipeline类在本地运行各种语言模型，包括普通开源模型和需要授权的Gated模型。

环境准备

在开始之前，需要确保已安装必要的Python包：

pip install transformers torch

对于更高效的内存使用，可以额外安装xformer包（可选）：

pip install xformers

基础模型加载

直接加载模型

LangChain提供了便捷的HuggingFacePipeline.from_model_id方法，可以直接从Hugging Face模型中心加载模型：

from langchain_huggingface import HuggingFacePipeline

hf = HuggingFacePipeline.from_model_id(
    model_id="beomi/llama-2-ko-7b",
    task="text-generation",
    pipeline_kwargs={"max_new_tokens": 512}
)

使用Transformers Pipeline

也可以先创建Transformers的pipeline，再传递给LangChain：

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "microsoft/Phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=512)

hf = HuggingFacePipeline(pipeline=pipe)

处理Gated模型

某些模型（如Gemma系列）需要用户同意许可协议后才能使用。这类模型称为Gated模型，使用时需要提供Hugging Face访问令牌：

your_huggingface_token = "hf_xxxxxxxxxxxxxxxxxxxx"  # 替换为你的真实token

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it", token=your_huggingface_token)
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-7b-it",
    token=your_huggingface_token
)

构建问答链

加载模型后，可以将其与PromptTemplate结合构建完整的问答链：

from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

template = """<|system|>You are a helpful assistant.<|end|>
<|user|>{question}<|end|>
<|assistant|>"""

prompt = PromptTemplate.from_template(template)
chain = prompt | hf | StrOutputParser()

response = chain.invoke({"question": "대한민국의 수도는 어디야?"})
print(response)

GPU加速

单GPU推理

通过指定device参数可以将模型加载到GPU上：

gpu_llm = HuggingFacePipeline.from_model_id(
    model_id="beomi/llama-2-ko-7b",
    task="text-generation",
    device=0,  # 使用第一个GPU
    pipeline_kwargs={"max_new_tokens": 64}
)

批量GPU推理

对于多个输入，可以使用batch方法提高GPU利用率：

gpu_llm = HuggingFacePipeline.from_model_id(
    model_id="beomi/llama-2-ko-7b",
    task="text-generation",
    device=0,
    batch_size=2,  # 根据GPU内存调整
    model_kwargs={"temperature": 0, "max_length": 256}
)

questions = [{"question": f"숫자 {i} 이 한글로 뭐에요?"} for i in range(4)]
answers = gpu_chain.batch(questions)

性能优化技巧

量化加载：对于大模型，可以使用4-bit量化减少内存占用：

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True
)

Flash Attention：在支持Ampere架构的GPU上启用Flash Attention加速：

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    attn_implementation="flash_attention_2"
)

设备映射：对于多GPU环境，使用device_map="auto"自动分配模型层到不同设备。

常见问题解决

内存不足：尝试减小batch_size或使用量化
模型加载失败：检查模型ID是否正确，Gated模型是否已授权
性能低下：确保CUDA环境配置正确，考虑使用更高效的attention实现

结语

通过LangChain-KR项目中的HuggingFacePipeline，开发者可以轻松地在本地运行各种Hugging Face模型，从简单的问答任务到复杂的文本生成应用。本文介绍的方法不仅适用于韩语模型，也适用于其他语言的模型，只需更换相应的模型ID即可。

掌握这些技术后，你可以构建更加强大和灵活的NLP应用，同时保持对模型和数据流的完全控制。

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考