克服“中间遗失“效应：优化文档顺序以提高信息检索效果

本文链接：https://blog.youkuaiyun.com/tt_jishu/article/details/144069123

在构建基于检索-生成（RAG）应用程序时，开发者经常会遇到随着检索文档数量的增加，模型性能明显下降的问题。特别是，当文档数量超过10时，这种问题尤为明显。这主要是因为在长上下文中，模型容易忽略中间位置的相关信息。本文将讨论如何通过重新排序文档来减轻这种"中间遗失"效应，使得最相关的文档出现在上下文的开头和结尾位置，以便更好地为大型语言模型（LLM）提供信息。

重新排序文档的重要性

通常，向量存储库中的查询会返回按相关性递减顺序排列的文档（例如，通过嵌入余弦相似度度量）。然而，这种排序方式可能导致模型在长上下文中忽视中间的关键信息。通过调整文档顺序，使最相关文档居于上下文的两端，并将相关性较低的文档放在中间，可以提高模型对重要信息的注意力。

示例：实现长上下文文档重排序

在本节中，我们将展示如何使用LangChain库的LongContextReorder文档变换器来实现文档的重排序。我们首先嵌入一些人工文档并将其索引到Chromavector存储中。

%pip install --upgrade --quiet sentence-transformers langchain-chroma langchain langchain-openai langchain-huggingface > /dev/null

from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings

# 获取嵌入向量
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

texts = [
    "Basquetball is a great sport.",
    "Fly me to the moon is one of my favourite songs.",
    "The Celtics are my favourite team.",
    "This is a document about the Boston Celtics",
    "I simply love going to the movies",
    "The Boston Celtics won the game by 20 points",
    "This is just a random text.",
    "Elden Ring is one of the best games in the last 15 years.",
    "L. Kornet is one of the best Celtics players.",
    "Larry Bird was an iconic NBA player.",
]

# 创建一个检索器
retriever = Chroma.from_texts(texts, embedding=embeddings).as_retriever(
    search_kwargs={"k": 10}
)
query = "What can you tell me about the Celtics?"

# 获取按相关性分数排序的相关文档
docs = retriever.invoke(query)
docs

文档按查询的相关性降序返回。下面，我们使用LongContextReorder变换器来实现重新排序：

from langchain_community.document_transformers import LongContextReorder

# 重新排序文档：
reordering = LongContextReorder()
reordered_docs = reordering.transform_documents(docs)

# 确认最相关的文档在开头和结尾
reordered_docs