使用 Pinecone 和 SelfQueryRetriever 构建智能检索系统-优快云博客

本文链接：https://blog.youkuaiyun.com/srudfktuffk/article/details/145150977

在本篇文章中，我们将深入探讨如何使用 Pinecone 向量数据库与 SelfQueryRetriever 结合构建一个智能化的文档检索系统。我们将通过一个实战演示，展示如何存储电影数据、创建一个可查询的向量索引，并使用自然语言实现复杂的检索需求。无论是构建推荐系统，还是实现高效的语义检索，这种方法都非常有效。

1. 技术背景介绍

Pinecone 是一个强大的向量数据库，支持高效存储和查询嵌入向量。结合自然语言处理技术（如 OpenAI 提供的嵌入向量生成 API），Pinecone 可以帮助开发者构建类似 GPT 的上下文检索应用。

而 SelfQueryRetriever 是一个能够自动解析查询并生成结构化过滤条件的工具，它可以让用户通过自然语言与数据库交互，极大简化检索过程。

2. 核心原理解析

以下是本例的核心流程：

数据准备：将电影相关文本及其元信息（如年份、评分、导演等）存储为文档对象。
嵌入生成：使用 OpenAI 提供的 embeddings API，将电影摘要转化为高维度向量表示。
向量索引创建：借助 Pinecone 将嵌入存储为向量索引。
自然语言检索：
- 使用 SelfQueryRetriever 将用户自然语言查询转化为语义匹配向量和过滤条件。
- 检索对应的文档并返回结果。

3. 代码实现演示

安装必要的依赖

以下代码展示了如何快速配置开发环境。

# 安装 Pinecone 和 Lark 包
%pip install --upgrade --quiet lark
%pip install --upgrade --quiet pinecone-client==3.2.2

配置 Pinecone 和 OpenAI API

使用以下代码连接 Pinecone 服务和 OpenAI 嵌入服务：

from pinecone import Pinecone, ServerlessSpec
from langchain_openai import OpenAIEmbeddings
import os

# 连接到 Pinecone 服务
api_key = "your-pinecone-api-key"  # 替换为你的 Pinecone API Key
index_name = "langchain-self-retriever-demo"

pc = Pinecone(api_key=api_key)

# 使用 OpenAI Embeddings 服务生成嵌入
openai_api_key = "your-openai-api-key"  # 替换为你的 OpenAI API Key
os.environ["OPENAI_API_KEY"] = openai_api_key
embeddings = OpenAIEmbeddings()

创建向量索引并存储文档

这里我们准备一个简单的电影数据集：

from langchain_core.documents import Document
from langchain_pinecone import PineconeVectorStore

# 准备文档数据
docs = [
    Document(page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
             metadata={"year": 1993, "rating": 7.7, "genre": ["action", "science fiction"]}),
    Document(page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
             metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2}),
    Document(page_content="Three men walk into the Zone, three men walk out of the Zone",
             metadata={"year": 1979, "director": "Andrei Tarkovsky", "genre": ["science fiction", "thriller"], "rating": 9.9}),
]

# 创建 Pinecone 存储
if index_name not in pc.list_indexes().names():
    pc.create_index(name=index_name, dimension=1536, metric="cosine", spec=ServerlessSpec(cloud="aws", region="us-east-1"))

vectorstore = PineconeVectorStore.from_documents(docs, embeddings, index_name=index_name)

创建 SelfQueryRetriever 检索器

接下来，我们初始化 SelfQueryRetriever：

from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_openai import OpenAI

# 定义元信息字段
metadata_field_info = [
    AttributeInfo(name="genre", description="The genre of the movie", type="string or list[string]"),
    AttributeInfo(name="year", description="The year the movie was released", type="integer"),
    AttributeInfo(name="director", description="The name of the movie director", type="string"),
    AttributeInfo(name="rating", description="A 1-10 rating for the movie", type="float"),
]

document_content_description = "Brief summary of a movie"
llm = OpenAI(temperature=0)

# 初始化 SelfQueryRetriever
retriever = SelfQueryRetriever.from_llm(
    llm, vectorstore, document_content_description, metadata_field_info, verbose=True
)

测试检索功能

以下示例展示了如何使用自然语言查询电影数据：

指定关键字查询：

retriever.invoke("What are some movies about dinosaurs")

指定元信息过滤：

retriever.invoke("I want to watch a movie rated higher than 8.5")

组合条件查询：

retriever.invoke("What's a highly rated (above 8.5) science fiction film?")

复杂过滤条件查询：

retriever.invoke(
    "What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated"
)

4. 应用场景分析

以上方法不仅适用于电影推荐，还可以应用于以下场景：

个性化内容推荐：基于用户兴趣快速匹配相关内容。
智能企业文档搜索：在大量文档中找到与用户查询相关的内容。
知识库问答：结合语言模型提供类似 ChatGPT 的问答功能。

5. 实践建议

嵌入向量质量：选择合适的向量生成模型（如 OpenAI Embeddings），以提升检索精度。
索引优化：根据数据量调优 Pinecone 参数，如分片数和维度。
查询设计：合理设计元信息字段及其描述，有助于提升 SelfQueryRetriever 表现。
性能监控：在实际部署中，监控查询耗时和准确率，并结合缓存机制优化性能。

如果遇到问题欢迎在评论区交流。

—END—