Cohere reranker 一致的排序器

这本notebook展示了如何在检索器中使用 Cohere 的重排端点。这是在 ContextualCompressionRetriever 的想法基础上构建的。

%pip install --upgrade --quiet  cohere
%pip install --upgrade --quiet  faiss

# OR  (depending on Python version)

%pip install --upgrade --quiet  faiss-cpu
# get a new token: https://dashboard.cohere.ai/

import getpass
import os

os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")
# get a new token: https://dashboard.cohere.ai/

import getpass
import os

os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

建立基础矢量存储检索器

让我们首先初始化一个简单的向量存储检索器,并存储 2023 年(分块)。我们可以设置检索器以检索大量(20)的文档。

from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import CohereEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter

documents = TextLoader("../../how_to/state_of_the_union.txt").load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
texts = text_splitter.split_documents(documents)
retriever = FAISS.from_documents(texts, CohereEmbeddings()).as_retriever(
    search_kwargs={"k": 20}
)

query = "What did the president say about Ketanji Brown Jackson"
docs = retriever.invoke(query)
pretty_print_docs(docs)
API 参考:TextLoader | CohereEmbeddings | FAISS | RecursiveCharacterTextSplitter
Document 1:

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
----------------------------------------------------------------------------------------------------
Document 2:

We cannot let this happen. 

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
----------------------------------------------------------------------------------------------------
Document 3:

As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. 

While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice.
----------------------------------------------------------------------------------------------------

使用 CohereRerank 进行重新排名

现在让我们用一个 ContextualCompressionRetriever 包装我们的基础检索器。我们会添加一个 CohereRerank ,使用 Cohere 重新排名端点来重新排列返回的结果。

from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain_community.llms import Cohere

llm = Cohere(temperature=0)
compressor = CohereRerank()
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

compressed_docs = compression_retriever.invoke(
    "What did the president say about Ketanji Jackson Brown"
)
pretty_print_docs(compressed_docs)
API 参考: ContextualCompressionRetriever | CohereRerank | Cohere

您当然可以在问答管道中使用这个检索器

from langchain.chains import RetrievalQA
API 参考:RetrievalQA
chain = RetrievalQA.from_chain_type(
    llm=Cohere(temperature=0), retriever=compression_retriever
)

chain({"query": query})
{'query': 'What did the president say about Ketanji Brown Jackson',
 'result': " The president speaks highly of Ketanji Brown Jackson, stating that she is one of the nation's top legal minds, and will continue the legacy of excellence of Justice Breyer. The president also mentions that he worked with her family and that she comes from a family of public school educators and police officers. Since her nomination, she has received support from various groups, including the Fraternal Order of Police and judges from both major political parties. \n\nWould you like me to extract another sentence from the provided text? "}
### Reranker 技术实现与应用 #### 什么是 RerankerReranker 是一种用于提升检索系统性能的技术,它通过对接收到的初步检索结果进行二次排序,进一步提高返回结果的相关性和准确性[^1]。 #### Reranker 的工作原理 Reranker 接收来自初级检索的结果列表作为输入,并对其进行更精细的评估和重新排列。这一过程通常依赖于复杂的机学习模型,例如 BERT 或其他 Transformer-based 模型,它们能够捕捉到更深的语义信息并据此调整排名顺序[^2]。 #### 使用案例和技术选型 - **Cohere AI 和 Jina Reranker** 这些工具提供了一种简单的方式来集成先进的自然语言处理能力至现有的搜索引擎之中。例如,Jina Reranker v2 提供了一个经过多阶段训练得到的强大模型,适用于多种实际应用场景中的精确结果重排需求。 - **BGE-Reranker-large** 此类大型预训练模型因其易用性受到广泛欢迎。用户只需通过标准 API 即可完成部署与调用操作,同时官方还准备了大量的教程资料辅助开发者迅速掌握核心技术要点[^3]。 #### 性能考量因素 尽管引入 Reranker 能够显著改善最终输出质量,但也伴随着额外的时间消耗以及经济投入等问题。因此,在决定是否采纳该方案前需仔细权衡以下几个方面: - **精度 vs 延迟** - **成本效益分析** 对于那些追求极致用户体验且资源允许的情况下,采用含 Rerank 阶段的整体架构无疑是明智之举;然而面对实时性强、访问压力巨大的业务环境,则可能需要探索更为轻量化替代品或者仅限部分核心功能启用此组件来平衡各方面指标表现。 #### 示例代码片段展示如何加载及运用 Hugging Face 平台上的 reranking 解决方案: ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification model_name = "Alibaba-NLP/new-impl" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) def rerank(query, documents): inputs = tokenizer([query]*len(documents), documents, return_tensors="pt", padding=True, truncation=True) outputs = model(**inputs).logits.detach().numpy() scores = list(zip(outputs[:,1],documents)) sorted_scores = sorted(scores, key=lambda x:x[0], reverse=True) return [doc for score, doc in sorted_scores] example_query = "What is the capital of France?" candidate_documents = ["Paris is known as the city of lights.", "Berlin has many historical landmarks.", ... ] reranked_results = rerank(example_query, candidate_documents) print(reranked_results[:5]) # Top five most relevant results. ``` 上述脚本展示了怎样借助阿里巴巴发布的新型实现版本执行基本的 re-ranking 功能[^4]。 #### 向量索引算法优化建议 除了单纯依靠高级别的 NLP 工具外,还可以尝试从底层基础设施层面入手改进整个系统的运作效率。比如定制化 embedding models 来更好地适配具体行业特点的数据集结构特征;或者是实施混合查询策略——即把传统的 keyword matching 方法同 modern vector similarity search 结合起来共同发挥作用等等措施均有助于达成更好的整体效果[^5]。 ---
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值