LlamaIndex中的数据加载器：如何处理不同格式的数据

原创于 2025-12-01 18:59:44 发布 · 370 阅读

CC 4.0 BY-SA版权

文章标签：

摘要

在构建LLM应用时，数据的获取和处理是至关重要的第一步。LlamaIndex提供了丰富的数据加载器（Readers），支持从各种数据源加载不同类型的数据，包括PDF文档、Word文件、网页内容、数据库记录等。本文将详细介绍LlamaIndex中的数据加载器，通过实际代码示例展示如何处理不同格式的数据，并分享一些实用的最佳实践。

正文

1. 引言

在前面的博客中，我们介绍了LlamaIndex的核心概念和基本使用方法。在实际应用中，我们需要处理各种格式的数据，这些数据可能来自文件系统、数据库、网络API等不同来源。LlamaIndex通过其丰富的数据加载器生态系统，为开发者提供了处理这些复杂数据场景的强大工具。

2. 数据加载器基础概念

LlamaIndex中的数据加载器主要分为两类：

核心加载器：包含在llama-index-core包中，提供基本的数据加载功能
集成加载器：通过单独的集成包提供，支持更专业的数据源

所有数据加载器都遵循统一的接口设计，使得开发者可以轻松切换不同的数据源。

3. 核心数据加载器

3.1 SimpleDirectoryReader

SimpleDirectoryReader是最常用的数据加载器，可以从指定目录加载多种格式的文件：

from llama_index.core import SimpleDirectoryReader

# 加载指定目录下的所有支持格式的文件
reader = SimpleDirectoryReader("./data")
documents = reader.load_data()

# 加载特定文件
reader = SimpleDirectoryReader(
    input_files=["./data/file1.txt", "./data/file2.pdf"]
)
documents = reader.load_data()

# 递归加载子目录
reader = SimpleDirectoryReader("./data", recursive=True)
documents = reader.load_data()

SimpleDirectoryReader支持的文件格式包括：

文本文件（.txt）
Markdown文件（.md）
PDF文件（.pdf）
Word文档（.docx）
CSV文件（.csv）
HTML文件（.html）

3.2 StringIterableReader

用于从字符串列表创建Document对象：

from llama_index.core import StringIterableReader

# 从字符串列表创建Document
texts = [
    "这是第一段文本内容",
    "这是第二段文本内容",
    "这是第三段文本内容"
]

reader = StringIterableReader()
documents = reader.load_data(texts)

4. 集成数据加载器

LlamaIndex提供了超过150种集成数据加载器，支持从各种数据源加载数据。以下是一些常用的集成加载器示例：

4.1 数据库加载器

# 需要安装: pip install llama-index-readers-database
from llama_index.readers.database import DatabaseReader

# 连接数据库并查询数据
reader = DatabaseReader(
    scheme="postgresql",
    host="localhost",
    port=5432,
    username="user",
    password="password",
    dbname="mydb",
)

# 执行SQL查询
documents = reader.load_data(query="SELECT * FROM articles")

4.2 网页加载器

# 需要安装: pip install llama-index-readers-web
from llama_index.readers.web import SimpleWebPageReader

# 加载网页内容
reader = SimpleWebPageReader()
documents = reader.load_data(urls=["https://example.com"])

4.3 YouTube加载器

# 需要安装: pip install llama-index-readers-youtube-transcript
from llama_index.readers.youtube_transcript import YoutubeTranscriptReader

# 加载YouTube视频字幕
reader = YoutubeTranscriptReader()
documents = reader.load_data(ytlinks=["https://youtube.com/watch?v=..."])

4.4 Slack加载器

# 需要安装: pip install llama-index-readers-slack
from llama_index.readers.slack import SlackReader

# 加载Slack消息
reader = SlackReader()
documents = reader.load_data(channel_ids=["C012AB3CD"])

5. 自定义数据加载器

当现有加载器无法满足需求时，我们可以创建自定义数据加载器：

from llama_index.core.readers import BaseReader
from llama_index.core.schema import Document
from typing import List

class CustomDataReader(BaseReader):
    """自定义数据加载器示例"""
    
    def load_data(self, file_path: str) -> List[Document]:
        """从自定义格式文件加载数据"""
        documents = []
        
        # 读取文件内容
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()
        
        # 解析自定义格式
        # 假设文件格式为: TITLE:标题内容\nCONTENT:内容\n---
        sections = content.split('---')
        
        for section in sections:
            if section.strip():
                lines = section.strip().split('\n')
                title = ""
                text = ""
                
                for line in lines:
                    if line.startswith('TITLE:'):
                        title = line[6:].strip()
                    elif line.startswith('CONTENT:'):
                        text = line[8:].strip()
                
                # 创建Document对象
                doc = Document(
                    text=text,
                    metadata={
                        "title": title,
                        "source": file_path,
                        "format": "custom"
                    }
                )
                documents.append(doc)
        
        return documents

# 使用自定义加载器
custom_reader = CustomDataReader()
documents = custom_reader.load_data("./data/custom_format.txt")

6. 数据加载器的高级用法

6.1 文件元数据提取

可以为加载的文件添加自定义元数据：

from llama_index.core import SimpleDirectoryReader
import os

def filename_metadata(filename):
    """提取文件名作为元数据"""
    return {"filename": filename, "file_size": os.path.getsize(filename)}

# 使用自定义元数据函数
reader = SimpleDirectoryReader(
    "./data",
    file_metadata=filename_metadata
)
documents = reader.load_data()

6.2 过滤文件类型

可以指定只加载特定类型的文件：

from llama_index.core import SimpleDirectoryReader

# 只加载PDF和TXT文件
reader = SimpleDirectoryReader(
    "./data",
    required_exts=[".pdf", ".txt"]
)
documents = reader.load_data()

6.3 批量处理大型数据集

对于大型数据集，可以使用懒加载方式：

from llama_index.core import SimpleDirectoryReader

# 使用懒加载处理大型数据集
reader = SimpleDirectoryReader("./large_dataset")
for documents in reader.lazy_load_data():
    # 逐批处理文档
    process_documents(documents)

7. 处理特定格式数据的实践

7.1 PDF文档处理

from llama_index.core import SimpleDirectoryReader

# 加载PDF文档
reader = SimpleDirectoryReader("./pdfs", required_exts=[".pdf"])
documents = reader.load_data()

# 处理PDF元数据
for doc in documents:
    print(f"PDF标题: {doc.metadata.get('title', 'N/A')}")
    print(f"页数: {doc.metadata.get('page_count', 'N/A')}")
    print(f"内容预览: {doc.text[:100]}...")

7.2 Word文档处理

from llama_index.core import SimpleDirectoryReader

# 加载Word文档
reader = SimpleDirectoryReader("./docs", required_exts=[".docx"])
documents = reader.load_data()

# 提取Word文档样式信息
for doc in documents:
    print(f"段落数: {len(doc.text.split('\\n'))}")
    print(f"字符数: {len(doc.text)}")

7.3 CSV数据处理

from llama_index.core import SimpleDirectoryReader

# 加载CSV文件
reader = SimpleDirectoryReader("./data", required_exts=[".csv"])
documents = reader.load_data()

# 处理表格数据
for doc in documents:
    lines = doc.text.split('\n')
    headers = lines[0].split(',')
    print(f"表头: {headers}")
    
    # 处理数据行
    for line in lines[1:]:
        if line.strip():
            values = line.split(',')
            print(f"数据行: {dict(zip(headers, values))}")

8. 错误处理和最佳实践

8.1 异常处理

from llama_index.core import SimpleDirectoryReader
import logging

# 配置日志
logging.basicConfig(level=logging.INFO)

try:
    reader = SimpleDirectoryReader("./data")
    documents = reader.load_data()
    print(f"成功加载 {len(documents)} 个文档")
except Exception as e:
    logging.error(f"加载文档时出错: {e}")

8.2 内存优化

from llama_index.core import SimpleDirectoryReader

# 对于大型数据集，使用懒加载
reader = SimpleDirectoryReader("./large_data")

# 分批处理以节省内存
batch_size = 10
for i, documents in enumerate(reader.lazy_load_data()):
    # 处理当前批次
    process_batch(documents)
    
    # 定期清理内存
    if i % 100 == 0:
        import gc
        gc.collect()

8.3 数据清洗

from llama_index.core.schema import Document
import re

def clean_document(doc: Document) -> Document:
    """清洗文档内容"""
    # 移除多余的空白字符
    cleaned_text = re.sub(r'\s+', ' ', doc.text)
    
    # 移除特殊字符
    cleaned_text = re.sub(r'[^\w\s\u4e00-\u9fff.,!?;:]', '', cleaned_text)
    
    # 更新文档内容
    doc.text = cleaned_text.strip()
    
    return doc

# 清洗加载的文档
reader = SimpleDirectoryReader("./data")
documents = reader.load_data()

cleaned_documents = [clean_document(doc) for doc in documents]

9. 性能优化建议

9.1 并行加载

import asyncio
from llama_index.core import SimpleDirectoryReader

async def load_documents_async():
    reader = SimpleDirectoryReader("./data")
    documents = await reader.aload_data()
    return documents

# 异步加载文档
documents = asyncio.run(load_documents_async())

9.2 缓存机制

import hashlib
import pickle
import os

def load_documents_with_cache(reader, cache_file="documents_cache.pkl"):
    """带缓存的文档加载"""
    # 生成缓存键
    cache_key = hashlib.md5(str(reader.input_files).encode()).hexdigest()
    
    # 检查缓存是否存在
    if os.path.exists(cache_file):
        with open(cache_file, 'rb') as f:
            cache = pickle.load(f)
            if cache_key in cache:
                print("从缓存加载文档")
                return cache[cache_key]
    
    # 加载文档
    documents = reader.load_data()
    
    # 保存到缓存
    cache = {}
    if os.path.exists(cache_file):
        with open(cache_file, 'rb') as f:
            cache = pickle.load(f)
    
    cache[cache_key] = documents
    with open(cache_file, 'wb') as f:
        pickle.dump(cache, f)
    
    return documents

# 使用带缓存的加载器
reader = SimpleDirectoryReader("./data")
documents = load_documents_with_cache(reader)