使用MultiQueryRetriever提升信息检索效果的指南

最新推荐文章于 2025-01-20 22:11:48 发布

原创最新推荐文章于 2025-01-20 22:11:48 发布 · 314 阅读

5 ·

CC 4.0 BY-SA版权

文章标签：

#oracle #数据库 #python

引言

在现代信息检索中，基于距离的向量数据库检索通过将查询和文档嵌入到高维空间中，并根据距离度量找到相似的嵌入文档。然而，检索结果可能因为查询措辞的微妙变化或嵌入对数据语义捕捉不足而产生偏差。虽然可以通过提示工程来手动解决这些问题，但这往往是繁琐的。MultiQueryRetriever通过使用大型语言模型（LLM）来自动生成多视角的查询，从而优化检索过程，为用户提供更丰富的文档集。

本文将介绍如何使用MultiQueryRetriever进行信息检索，并通过代码示例展示其具体应用。

主要内容

构建向量数据库

我们将使用Lilian Weng的博客文章“LLM Powered Autonomous Agents”作为示例来构建一个向量数据库。

from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

# 加载博客文章
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()

# 文本分割
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(data)

# 构建向量数据库
embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=splits, embedding=embedding)

使用MultiQueryRetriever进行查询

通过指定用于查询生成的LLM，MultiQueryRetriever能够自动生成多样化的查询。

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI

question = "What are the approaches to Task Decomposition?"
llm = ChatOpenAI(temperature=0)
retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vectordb.as_retriever(), llm=llm
)

# 设置日志记录
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

unique_docs = retriever_from_llm.invoke(question)
print(len(unique_docs))  # 输出文档的数量

自定义提示模板

MultiQueryRetriever允许用户自定义提示模板，以生成特定格式的查询。

from typing import List
from langchain_core.output_parsers import BaseOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field

# 输出解析器将LLM结果分割为查询列表
class LineListOutputParser(BaseOutputParser[List[str]]):
    def parse(self, text: str) -> List[str]:
        lines = text.strip().split("\n")
        return list(filter(None, lines))  # 移除空行

output_parser = LineListOutputParser()

QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five 
    different versions of the given user question to retrieve relevant documents from a vector 
    database. By generating multiple perspectives on the user question, your goal is to help
    the user overcome some of the limitations of the distance-based similarity search. 
    Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

llm = ChatOpenAI(temperature=0)
llm_chain = QUERY_PROMPT | llm | output_parser

retriever = MultiQueryRetriever(
    retriever=vectordb.as_retriever(), llm_chain=llm_chain, parser_key="lines"
)

# 获取结果
unique_docs = retriever.invoke("What does the course say about regression?")
print(len(unique_docs))  # 输出文档的数量

常见问题和解决方案

网络访问问题：在某些地区，由于网络限制，访问OpenAI API可能受到影响。开发者可以考虑使用API代理服务来提高访问的稳定性，将API端点设置为 {AI_URL}。
查询结果不准确：如果检索到的文档不够相关，可以通过调整LLM的温度参数或者自定义查询生成的提示，来改善结果。

总结与进一步学习资源

MultiQueryRetriever通过生成多样化的查询来丰富信息检索的结果，是一种强大的工具。它简化了对于查询生成的手动调整工作，使开发人员能够专注于更高层次的任务。

为了进一步学习，开发者可以参考以下资源：

LangChain文档和API参考
OpenAI的嵌入模型指南
大型语言模型的提示工程技术
关于使用代理服务访问API的网络文章

参考资料

Lilian Weng, “LLM Powered Autonomous Agents”, 链接

如果这篇文章对你有帮助，欢迎点赞并关注我的博客。您的支持是我持续创作的动力！

—END—