LangChain-v0.2文档翻译：2.9、教程-基于查询分析系统构建一个本地RAG应用程序

翻译已于 2024-06-04 16:02:34 修改 · 219 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：https://python.langchain.com/v0.2/docs/tutorials/local_rag/

文章标签：

#langchain #人工智能 #AI编程

于 2024-06-03 17:56:48 首次发布

LangChain-v0.2文档翻译专栏收录该内容

31 篇文章

订阅专栏

部署运行你感兴趣的模型镜像

介绍
教程
2.1. 构建一个简单的 LLM 应用程序
2.2. 构建一个聊天机器人
2.3. 构建向量存储库和检索器
2.4. 构建一个代理
2.5. 构建检索增强生成 (RAG) 应用程序
2.6. 构建一个会话式RAG应用程序
2.7. 在SQL数据上构建一个问答系统
2.8. 构建查询分析系统
2.9. 基于查询分析系统构建一个本地RAG应用程序（点击查看原文）

构建本地RAG应用

像PrivateGPT、llama.cpp、GPT4All和llamafile这样的项目的流行度突显了本地运行大型语言模型(LLMs)的重要性。

LangChain与许多可以本地运行的开源LLMs集成`。

例如，这里我们展示了如何使用本地嵌入和本地LLM本地（例如，在您的笔记本电脑上）运行GPT4All或LLaMA2。

文档加载

首先，安装本地嵌入和向量存储所需的包。

%pip install --upgrade --quiet langchain langchain-community langchainhub gpt4all langchain-chroma

加载并分割一个示例文档。
我们将以一篇关于智能体的博客文章为例。

from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)

接下来，下面的步骤将本地下载GPT4All嵌入（如果您还没有它们）。

from langchain_chroma import Chroma
from langchain_community.embeddings import GPT4AllEmbeddings

vectorstore = Chroma.from_documents(documents=all_splits, embedding=GPT4AllEmbeddings())

测试相似性搜索是否与我们的本地嵌入一起工作。

question = "What are the approaches to Task Decomposition?"
docs = vectorstore.similarity_search(question)
len(docs)

Document(page_content='Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.', metadata={'description': 'Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:', 'language': 'en', 'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'title': "LLM Powered Autonomous Agents | Lil'Log"})

模型

LLaMA2

注意：llama-cpp-python的新版本使用GGUF模型文件。

如果您有现有的GGML模型，请参阅转换为GGUF的说明。
或者，您可以下载一个已转换为GGUF的模型。
最后，如详细说明，请安装llama-cpp-python。

%pip install --upgrade --quiet llama-cpp-python

要启用Apple Silicon上的GPU，请按照此处的步骤使用支持Metal的Python绑定。
特别确保conda正在使用您创建的正确虚拟环境（miniforge3）。
例如，对我来说：

conda activate /Users/rlm/miniforge3/envs/llama

确认后：

! CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 /Users/rlm/miniforge3/envs/llama/bin/pip install -U llama-cpp-python --no-cache-dir

from langchain_community.llms import LlamaCpp

按照llama.cpp文档中指出的设置模型参数。

n_gpu_layers = 1  # Metal设置为1就足够了。
n_batch = 512  # 应该在1和n_ctx之间，考虑您的Apple Silicon芯片的RAM数量。

# 确保模型路径对您的系统是正确的！
llm = LlamaCpp(
    model_path="/Users/rlm/Desktop/Code/llama.cpp/models/llama-2-13b-chat.ggufv3.q4_0.bin",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    n_ctx=2048,
    f16_kv=True,  # 必须设置为True，否则在几次调用后会遇到问题
    verbose=True,
)

请注意这些表明Metal已正确启用：

ggml_metal_init: allocating
ggml_metal_init: using MPS

llm.invoke("Simulate a rap battle between Stephen Colbert and John Oliver")

Llama.generate: prefix-match hit
``````output
by jonathan 

Here's the hypothetical rap battle:

[Stephen Colbert]: Yo, this is Stephen Colbert, known for my comedy show. I'm here to put some sense in your mind, like an enema do-go. Your opponent? A man of laughter and witty quips, John Oliver! Now let's see who gets the most laughs while taking shots at each other

[John Oliver]: Yo, this is John Oliver, known for my own comedy show. I'm here to take your mind on an adventure through wit and humor. But first, allow me to you to our contestant: Stephen Colbert! His show has been around since the '90s, but it's time to see who can out-rap whom

[Stephen Colbert]: You claim to be a witty man, John Oliver, with your British charm and clever remarks. But my knows that I'm America's funnyman! Who's the one taking you? Nobody!

[John Oliver]: Hey Stephen Colbert, don't get too cocky. You may
``````output

llama_print_timings:        load time =  4481.74 ms
llama_print_timings:      sample time =   183.05 ms /   256 runs   (    0.72 ms per token,  1398.53 tokens per second)
llama_print_timings: prompt eval time =   456.05 ms /    13 tokens (   35.08 ms per token,    28.51 tokens per second)
llama_print_timings:        eval time =  7375.20 ms /   255 runs   (   28.92 ms per token,    34.58 tokens per second)
llama_print_timings:       total time =  8388.92 ms

"by jonathan \n\nHere's the hypothetical rap battle:\n\n[Stephen Colbert]: Yo, this is Stephen Colbert, known for my comedy show. I'm here to put some sense in your mind, like an enema do-go. Your opponent? A man of laughter and witty quips, John Oliver! Now let's see who gets the most laughs while taking shots at each other\n\n[John Oliver]: Yo, this is John Oliver, known for my own comedy show. I'm here to take your mind on an adventure through wit and humor. But first, allow me to you to our contestant: Stephen Colbert! His show has been around since the '90s, but it's time to see who can out-rap whom\n\n[Stephen Colbert]: You claim to be a witty man, John Oliver, with your British charm and clever remarks. But my knows that I'm America's funnyman! Who's the one taking you? Nobody!\n\n[John Oliver]: Hey Stephen Colbert, don't get too cocky. You may"

GPT4All

同样，我们可以使用GPT4All。

下载GPT4All模型二进制文件。

GPT4All上的模型浏览器是选择和下载模型的好方法。

然后，指定您下载到的路径。

例如，对我来说，模型位于这里：
/Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin

from langchain_community.llms import GPT4All
gpt4all = GPT4All(
    model="/Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin",
    max_tokens=2048,
)

llamafile

运行LLM本地的一种最简单方式是使用llamafile。您所需要做的就是：

从HuggingFace下载llamafile
使文件可执行
运行文件

llamafiles将模型权重和llama.cpp的特制编译版本捆绑到一个单一文件中，可以在大多数计算机上无需任何额外依赖即可运行。它们还带有一个内置的推理服务器，提供与您的模型交互的API。

以下是一个简单的bash脚本，显示所有3个设置步骤：

# 从HuggingFace下载llamafile
wget https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile

# 使文件可执行。在Windows上，改为将文件重命名为以".exe"结尾。
chmod +x TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile

# 启动模型服务器。默认监听http://localhost:8080。
./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile --server --nobrowser

在您运行上述设置步骤后，您可以通过LangChain与模型交互：

from langchain_community.llms.llamafile import Llamafile
llamafile = Llamafile()

llamafile.invoke("Here is my grandmother's beloved recipe for spaghetti and meatballs:")

-1 1/2 (8 oz. Pounds) ground beef, browned and cooked until no longer pink
-3 cups whole wheat spaghetti
-4 (10 oz) cans diced tomatoes with garlic and basil
-2 eggs, beaten
-1 cup grated parmesan cheese
-1/2 teaspoon salt
-1/4 teaspoon black pepper
-1 cup breadcrumbs (16 oz)
-2 tablespoons olive oil

Instructions:
1. Cook spaghetti according to package directions. Drain and set aside.
2. In a large skillet, brown ground beef over medium heat until no longer pink. Drain any excess grease.
3. Stir in diced tomatoes with garlic and basil, and season with salt and pepper. Cook for 5 to 7 minutes or until sauce is heated through. Set aside.
4. In a large bowl, beat eggs with a fork or whisk until fluffy. Add cheese, salt, and black pepper. Set aside.
5. In another bowl, combine breadcrumbs and olive oil. Dip each spaghetti into the egg mixture and then coat in the breadcrumb mixture. Place on baking sheet lined with parchment paper to prevent sticking. Repeat until all spaghetti are coated.
6. Heat oven to 375 degrees. Bake for 18 to 20 minutes, or until lightly golden brown.
7. Serve hot with meatballs and sauce on the side. Enjoy!

在链中使用

我们可以通过传入检索到的文档和简单的提示来创建摘要链。

它使用提供的输入键值格式化提示模板，并将格式化后的字符串传递给GPT4All、LLama-V2或另一个指定的LLM。

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate

# 提示
prompt = PromptTemplate.from_template(
    "Summarize the main themes in these retrieved docs: {docs}"
)

# 链
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

chain = {"docs": format_docs} | prompt | llm | StrOutputParser()

# 运行
question = "What are the approaches to Task Decomposition?"
docs = vectorstore.similarity_search(question)
chain.invoke(docs)

Llama.generate: prefix-match hit
``````output

Based on the retrieved documents, the main themes are:
1. Task decomposition: The ability to break down complex tasks into smaller subtasks, which can be handled by an LLM or other components of the agent system.
2. LLM as the core controller: The use of a large language model (LLM) as the primary controller of an autonomous agent system, complemented by other key components such as a knowledge graph and a planner.
3. Potentiality of LLM: The idea that LLMs have the potential to be used as powerful general problem solvers, not just for generating well-written copies but also for solving complex tasks and achieving human-like intelligence.
4. Challenges in long-term planning: The challenges in planning over a lengthy history and effectively exploring the solution space, which are important limitations of current LLM-based autonomous agent systems.
``````output

llama_print_timings:        load time =  1191.88 ms
llama_print_timings:      sample time =   134.47 ms /   193 runs   (    0.70 ms per token,  1435.25 tokens per second)
llama_print_timings: prompt eval time = 39470.18 ms /  1055 tokens (   37.41 ms per token,    26.73 tokens per second)
llama_print_timings:        eval time =  8090.85 ms /   192 runs   (   42.14 ms per token,    23.73 tokens per second)
llama_print_timings:       total time = 47943.12 ms

'\nBased on the retrieved documents, the main themes are:\n1. Task decomposition: The ability to break down complex tasks into smaller subtasks, which can be handled by an LLM or other components of the agent system.\n2. LLM as the core controller: The use of a large language model (LLM) as the primary controller of an autonomous agent system, complemented by other key components such as a knowledge graph and a planner.\n3. Potentiality of LLM: The idea that LLMs have the potential to be used as powerful general problem solvers, not just for generating well-written copies but also for solving complex tasks and achieving human-like intelligence.\n4. Challenges in long-term planning: The challenges in planning over a lengthy history and effectively exploring the solution space, which are important limitations of current LLM-based autonomous agent systems.'

Q&A

我们还可以使用LangChain Prompt Hub来存储和获取特定于模型的提示。

让我们在这里尝试使用默认的RAG提示。

from langchain import hub

rag_prompt = hub.pull("rlm/rag-prompt")
rag_prompt.messages

[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"))]

from langchain_core.runnables import RunnablePassthrough, RunnablePick

# 链
chain = (
    RunnablePassthrough.assign(context=RunnablePick("context") | format_docs)
    | rag_prompt
    | llm
    | StrOutputParser()
)

# 运行
chain.invoke({"context": docs, "question": question})

Llama.generate: prefix-match hit
``````output

Task can be done by down a task into smaller subtasks, using simple prompting like "Steps for XYZ." or task-specific like "Write a story outline" for writing a novel.
``````output

llama_print_timings:        load time = 11326.20 ms
llama_print_timings:      sample time =    33.03 ms /    47 runs   (    0.70 ms per token,  1422.86 tokens per second)
llama_print_timings: prompt eval time =  1387.31 ms /   242 tokens (    5.73 ms per token,   174.44 tokens per second)
llama_print_timings:        eval time =  1321.62 ms /    46 runs   (   28.73 ms per token,    34.81 tokens per second)
llama_print_timings:       total time =  2801.08 ms

{'output_text': '\nTask can be done by down a task into smaller subtasks, using simple prompting like "Steps for XYZ." or task-specific like "Write a story outline" for writing a novel.'}

现在，让我们尝试使用专门为LLaMA设计的提示，其中包含特殊标记。

# 提示
rag_prompt_llama = hub.pull("rlm/rag-prompt-llama")
rag_prompt_llama.messages

ChatPromptTemplate(input_variables=['question', 'context'], output_parser=None, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['question', 'context'], output_parser=None, partial_variables={}, template="[INST]<<SYS>> You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.<</SYS>> \nQuestion: {question} \nContext: {context} \nAnswer: [/INST]", template_format='f-string', validate_template=True), additional_kwargs={})])

# 链
chain = (
    RunnablePassthrough.assign(context=RunnablePick("context") | format_docs)
    | rag_prompt_llama
    | llm
    | StrOutputParser()
)

# 运行
chain.invoke({"context": docs, "question": question})

Llama.generate: prefix-match hit
``````output
  Sure, I'd be happy to help! Based on the context, here are some to task:

1. LLM with simple prompting: This using a large model (LLM) with simple prompts like "Steps for XYZ" or "What are the subgoals for achieving XYZ?" to decompose tasks into smaller steps.
2. Task-specific: Another is to use task-specific, such as "Write a story outline" for writing a novel, to guide the of tasks.
3. Human inputs:, human inputs can be used to supplement the process, in cases where the task a high degree of creativity or expertise.

As fores in long-term and task, one major is that LLMs to adjust plans when faced with errors, making them less robust to humans who learn from trial and error.
``````output

llama_print_timings:        load time = 11326.20 ms
llama_print_timings:      sample time =   144.81 ms /   207 runs   (    0.70 ms per token,  1429.47 tokens per second)
llama_print_timings: prompt eval time =  1506.13 ms /   258 tokens (    5.84 ms per token,   171.30 tokens per second)
llama_print_timings:        eval time =  6231.92 ms /   206 runs   (   30.25 ms per token,    33.06 tokens per second)
llama_print_timings:       total time =  8158.41 ms

{'output_text': '  Sure, I\'d be happy to help! Based on the context, here are some to task:\n\n1. LLM with simple prompting: This using a large model (LLM) with simple prompts like "Steps for XYZ" or "What are the subgoals for achieving XYZ?" to decompose tasks into smaller steps.\n2. Task-specific: Another is to use task-specific, such as "Write a story outline" for writing a novel, to guide the of tasks.\n3. Human inputs:, human inputs can be used to supplement the process, in cases where the task a high degree of creativity or expertise.\n\nAs fores in long-term and task, one major is that LLMs to adjust plans when faced with errors, making them less robust to humans who learn from trial and error.'}

带检索的Q&A

而不是手动传入文档，我们可以基于用户问题自动从我们的向量存储中检索它们。

这将使用QA默认提示（此处显示）并将从向量数据库中检索。

retriever = vectorstore.as_retriever()
qa_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

qa_chain.invoke(question)

Llama.generate: prefix-match hit
``````output
  Sure! Based on the context, here's my answer to your:

There are several to task,:

1. LLM-based with simple prompting, such as "Steps for XYZ" or "What are the subgoals for achieving XYZ?"
2. Task-specific, like "Write a story outline" for writing a novel.
3. Human inputs to guide the process.

These can be used to decompose complex tasks into smaller, more manageable subtasks, which can help improve the and effectiveness of task. However, long-term and task can being due to the need to plan over a lengthy history and explore the space., LLMs may to adjust plans when faced with errors, making them less robust to human learners who can learn from trial and error.
``````output

llama_print_timings:        load time = 11326.20 ms
llama_print_timings:      sample time =   139.20 ms /   200 runs   (    0.70 ms per token,  1436.76 tokens per second)
llama_print_timings: prompt eval time =  1532.26 ms /   258 tokens (    5.94 ms per token,   168.38 tokens per second)
llama_print_timings:        eval time =  5977.62 ms /   199 runs   (   30.04 ms per token,    33.29 tokens per second)
llama_print_timings:       total time =  7916.21 ms

{'query': 'What are the approaches to Task Decomposition?',
 'result': '  Sure! Based on the context, here\'s my answer to your:\n\nThere are several to task,:\n\n1. LLM-based with simple prompting, such as "Steps for XYZ" or "What are the subgoals for achieving XYZ?"\n2. Task-specific, like "Write a story outline" for writing a novel.\n3. Human inputs to guide the process.\n\nThese can be used to decompose complex tasks into smaller, more manageable subtasks, which can help improve the and effectiveness of task. However, long-term and task can being due to the need to plan over a lengthy history and explore the space., LLMs may to adjust plans when faced with errors, making them less robust to human learners who can learn from trial and error.'}