数据连接器（Data Connectors）

原创已于 2024-07-31 16:16:32 修改 · 1.1k 阅读

7 ·

CC 4.0 BY-SA版权

文章标签：

#RAG #llmaindex

于 2024-07-31 16:16:21 首次发布

llamaindex 同时被 2 个专栏收录

28 篇文章

订阅专栏

8 篇文章

订阅专栏

数据连接器（Data Connectors）

概念解释

数据连接器（也称为读取器，Reader）是从不同数据源和数据格式中提取数据并将其转换为简单文档表示（文本和简单元数据）的工具。

提示：

一旦你提取了数据，你可以在其上构建索引（Index），使用查询引擎（Query Engine）提问，并使用聊天引擎（Chat Engine）进行对话。

使用模式

入门

每个数据加载器都包含一个“使用”部分，展示如何使用该加载器。使用每个加载器的核心是一个 download_loader 函数，它将加载器文件下载到一个模块中，你可以在应用程序中使用该模块。

示例用法：

from llama_index.core import VectorStoreIndex, download_loader

from llama_index.readers.google import GoogleDocsReader

gdoc_ids = ["1wf-y2pd9C878Oh-FmLH7Q_BQkljdm6TQal-c1pUfrec"]
loader = GoogleDocsReader()
documents = loader.load_data(document_ids=gdoc_ids)
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
query_engine.query("Where did the author go to school?")

LlamaHub

我们的数据连接器通过 LlamaHub 🦙 提供。LlamaHub 是一个开源仓库，包含你可以轻松插入任何 LlamaIndex 应用程序的数据加载器。

使用模式：

from llama_index.core import download_loader

from llama_index.readers.google import GoogleDocsReader

loader = GoogleDocsReader()
documents = loader.load_data(document_ids=[...])

查看完整的使用模式指南以获取更多详细信息。

模块

一些示例数据连接器：

本地文件目录（SimpleDirectoryReader）。可以支持解析多种文件类型：.pdf, .jpg, .png, .docx 等。
Notion（NotionPageReader）
Google Docs（GoogleDocsReader）
Slack（SlackReader）
Discord（DiscordReader）
Apify Actors（ApifyActor）。可以爬取网页、抓取网页内容、提取文本内容、下载文件包括 .pdf, .jpg, .png, .docx 等。

编程示例

以下是一些具体的编程示例，展示如何使用不同的数据连接器：

1. 本地文件目录（SimpleDirectoryReader）

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

# 加载本地文件目录中的文档
reader = SimpleDirectoryReader(input_dir="path/to/directory")
documents = reader.load_data()

# 构建索引
index = VectorStoreIndex.from_documents(documents)

# 创建查询引擎
query_engine = index.as_query_engine()

# 查询
response = query_engine.query("你的查询问题")
print(response)

2. Google Docs（GoogleDocsReader）

from llama_index.core import VectorStoreIndex, download_loader
from llama_index.readers.google import GoogleDocsReader

# 下载并使用 Google Docs 读取器
gdoc_ids = ["你的 Google Docs 文档 ID"]
loader = GoogleDocsReader()
documents = loader.load_data(document_ids=gdoc_ids)

# 构建索引
index = VectorStoreIndex.from_documents(documents)

# 创建查询引擎
query_engine = index.as_query_engine()

# 查询
response = query_engine.query("你的查询问题")
print(response)

3. Notion（NotionPageReader）

from llama_index.core import VectorStoreIndex, download_loader
from llama_index.readers.notion import NotionPageReader

# 配置 Notion API 令牌和页面 ID
notion_token = "你的 Notion API 令牌"
page_ids = ["你的 Notion 页面 ID"]

# 使用 Notion 读取器
loader = NotionPageReader(notion_token)
documents = loader.load_data(page_ids=page_ids)

# 构建索引
index = VectorStoreIndex.from_documents(documents)

# 创建查询引擎
query_engine = index.as_query_engine()

# 查询
response = query_engine.query("你的查询问题")
print(response)

拓展

自定义数据连接器

你可以创建自定义的数据连接器，以满足特定需求。以下是一个简单的示例，展示如何创建一个自定义的数据连接器：

from llama_index.core import BaseReader, Document, VectorStoreIndex

class CustomReader(BaseReader):
    def load_data(self, url):
        # 假设我们从 URL 获取数据并返回文档
        import requests
        response = requests.get(url)
        text = response.text
        return [Document(text=text)]

# 使用自定义读取器
reader = CustomReader()
documents = reader.load_data("https://example.com")

# 构建索引
index = VectorStoreIndex.from_documents(documents)

# 创建查询引擎
query_engine = index.as_query_engine()

# 查询
response = query_engine.query("你的查询问题")
print(response)