使用 ExLlamaV2 在本地运行 LLM：完整安装及实践指南

## 技术背景介绍

ExLlamaV2 是一种快速推理库，用于在现代消费级 GPU 上本地运行大型语言模型（LLM）。它支持 GPTQ 和 EXL2 量化模型的推理，这些模型可以从 Hugging Face 获得。在本文中，我们将展示如何在 LangChain 中使用 ExLlamaV2。

## 核心原理解析

ExLlamaV2 的核心优势在于其对量化模型的支持，这使得在较小的硬件资源上运行大型模型成为可能。量化模型减少了内存使用，允许在显存较小的 GPU 上进行推理。ExLlamaV2 特别优化了推理速度和内存使用，使得本地化运行成为一种高效的选择。

## 代码实现演示

首先，让我们来设置环境并安装所需的库。确保你的 Python 版本是 3.11，并安装如下依赖：
```bash
pip install langchain==0.1.7 torch==2.1.1+cu121
pip install https://github.com/turboderp/exllamav2/releases/download/v0.0.12/exllamav2-0.0.12+cu121-cp311-cp311-linux_x86_64.whl

如果你使用 Conda，安装以下依赖：

conda install -c conda-forge ninja ffmpeg gxx=11.4
conda install -c nvidia/label/cuda-12.1.0 cuda

接下来，下载并加载适合的量化模型：

import os
from huggingface_hub import snapshot_download
from langchain_community.llms.exllamav2 import ExLlamaV2
from langchain_core.callbacks import StreamingStdOutCallbackHandler
from langchain_core.prompts import PromptTemplate
from libs.langchain.langchain.chains.llm import LLMChain

def download_GPTQ_model(model_name: str, models_dir: str = "./models/") -> str:
    if not os.path.exists(models_dir):
        os.makedirs(models_dir)

    _model_name = model_name.replace("/", "_")
    model_path = os.path.join(models_dir, _model_name)
    if _model_name not in os.listdir(models_dir):
        snapshot_download(repo_id=model_name, local_dir=model_path, local_dir_use_symlinks=False)
    else:
        print(f"{model_name} already exists in the models directory")

    return model_path

model_path = download_GPTQ_model("TheBloke/Mistral-7B-Instruct-v0.2-GPTQ")
callbacks = [StreamingStdOutCallbackHandler()]

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.85
settings.top_k = 50
settings.top_p = 0.8
settings.token_repetition_penalty = 1.05

llm = ExLlamaV2(
    model_path=model_path,
    callbacks=callbacks,
    verbose=True,
    settings=settings,
    streaming=True,
    max_new_tokens=150,
)

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "What Football team won the UEFA Champions League in the year the iphone 6s was released?"
output = llm_chain.invoke({"question": question})
print(output)

应用场景分析

在资源受限的环境中，使用量化模型的本地推理对于降低成本并提高效率非常有帮助。例如，可以在个人开发机上进行实验或在边缘设备上部署语言模型。

实践建议

确保 CUDA 和 PyTorch 版本兼容，以充分利用 GPU 加速。
评估模型的内存需求，并选择合适的量化模型以匹配你的硬件配置。
使用回调功能监控推理过程，有助于调试和优化。

结束语：如果遇到问题欢迎在评论区交流。


---END---