PyPDFLoader

PyPDFLoader 是 LangChain 库中的一个文档加载工具,专门用来加载 PDF 文件中的文本内容,并将其转换成适合大模型(LLM)使用的文档格式(Document 对象),可用于后续的向量化、RAG 检索等操作。


✅ 总体作用

通俗讲:
📄 它就像一个「PDF 文本提取器」,帮你把 PDF 文件中的内容读出来,处理成结构化的文档块(document chunks),供后续的大模型处理用。


🧱 使用示例

先看个常见用法:

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("example.pdf")
pages = loader.load()

print(pages[0].page_content)

✅ 各行含义解释

代码含义
from langchain_community.document_loaders import PyPDFLoader从 LangChain 的 document_loaders 模块中导入 PyPDFLoader 工具
loader = PyPDFLoader("example.pdf")创建一个加载器实例,准备加载名为 example.pdf 的文件
pages = loader.load()加载 PDF,并返回一个 列表,每一项是一个 Document 对象(代表一页 PDF 内容)
pages[0].page_content打印第1页的纯文本内容

🔍 返回的是什么?

每一页内容会变成一个 Document 对象,包含两个关键字段:

Document(
    page_content="这是这一页提取出来的文本内容……",
    metadata={"source": "example.pdf", "page": 0}
)
  • page_content: 提取出来的纯文本内容
  • metadata: 元信息,比如文件名、页码,便于后续追踪来源

✨ PyPDFLoader 的用途场景

它一般用在以下流程里:

PDF → PyPDFLoader → Document 列表 → TextSplitter → 向量化 → 检索 → LLM 问答

举个实际例子:

假设你有一个医疗指南 PDF,要构建一个智能问答系统:

  1. PyPDFLoader 把 PDF 加载成结构化文本
  2. TextSplitter 切分成合适的段落
  3. 向量化并存进向量数据库(如 FAISS/Milvus)
  4. 用户提问时检索相关内容
  5. 提供上下文给大模型,让它作答

⚠️ 注意事项

  • 依赖于 pdfplumberPyPDF2 之类的底层 PDF 解析器,对复杂排版的 PDF 不一定能提得很好
  • 只能处理纯文本,不能提取图像、表格结构等信息(不过可以结合 OCR 做增强)

✅ 小结

特性说明
📄 功能把 PDF 加载为结构化文档(Document)对象
🧠 用途适用于 LangChain 中 RAG 问答、摘要、搜索等任务
🧱 输出每一页变成一个 Document,附带页码元信息
🚧 局限无法提取图像、复杂表格,文本提取依赖底层库质量

结合 PyPDFLoaderPyPDF2,一边提文本,一边提“真实 metadata”:

  • PyPDFLoader:用于提取每一页的文本
  • PyPDF2:用于提取 PDF 的真实元信息(如标题、作者、创建时间等)
  • ✅ 自定义封装成 CustomPDFLoader 类:同时返回文本和丰富的 metadata

✅ 自定义加载器:CustomPDFLoader

from langchain_core.documents import Document
from langchain_community.document_loaders import PyPDFLoader
from PyPDF2 import PdfReader
from pathlib import Path

class CustomPDFLoader:
    def __init__(self, file_path: str):
        self.file_path = file_path

    def load(self):
        # 获取真实 PDF 元信息
        reader = PdfReader(self.file_path)
        pdf_metadata = reader.metadata or {}
        metadata_common = {
            "title": pdf_metadata.get("/Title", ""),
            "author": pdf_metadata.get("/Author", ""),
            "creator": pdf_metadata.get("/Creator", ""),
            "producer": pdf_metadata.get("/Producer", ""),
            "creation_date": pdf_metadata.get("/CreationDate", ""),
            "mod_date": pdf_metadata.get("/ModDate", ""),
            "source": str(Path(self.file_path).resolve())
        }

        # 用 PyPDFLoader 加载文本
        text_loader = PyPDFLoader(self.file_path)
        pages = text_loader.load()

        # 给每一页补充真实元数据
        documents = []
        for i, page in enumerate(pages):
            enriched_metadata = dict(metadata_common)  # 复制一份公共 metadata
            enriched_metadata["page"] = page.metadata.get("page", i)

            documents.append(Document(
                page_content=page.page_content,
                metadata=enriched_metadata
            ))

        return documents

✅ 使用示例

loader = CustomPDFLoader("example.pdf")
docs = loader.load()

# 打印第1页的内容和元信息
print("内容:\n", docs[0].page_content[:300])  # 只打印前300个字符
print("\n元信息:\n", docs[0].metadata)

✅ 输出示例(举例说明)

内容:
第1章 医疗安全指南……
……

元信息:
{
  'title': 'Medical Safety Guide',
  'author': 'Dr. Zhang Wei',
  'creator': 'Microsoft Word',
  'producer': 'Adobe PDF Library',
  'creation_date': 'D:20230710120000',
  'mod_date': 'D:20230711154500',
  'source': '/Users/you/project/example.pdf',
  'page': 0
}

✅ 可选增强:增加页码字段到文本开头

如果你希望每一页文本加上提示(如“第X页”),你可以改下面这句:

page_content = f"[第{i+1}页]\n" + page.page_content

然后放入 Document 中。


🧠 总结

特性说明
🧾 真实 metadata读取 PDF 自带的作者、标题、创建时间等
📄 文本加载使用 LangChain 的 PyPDFLoader
📌 页码来源来自 PyPDFLoader,每页生成一个 Document
📁 文件路径会被写入 source 字段
✅ 输出适用于后续向量化、RAG 检索、问答系统等场景

完整代码:

from langchain_core.documents import Document
from langchain_community.document_loaders import PyPDFLoader
from PyPDF2 import PdfReader
from pathlib import Path

class CustomPDFLoader:
    def __init__(self, file_path: str):
        self.file_path = file_path

    def load(self):
        # 获取真实 PDF 元信息
        reader = PdfReader(self.file_path)
        pdf_metadata = reader.metadata or {}
        metadata_common = {
            "title": pdf_metadata.get("/Title", ""),
            "author": pdf_metadata.get("/Author", ""),
            "creator": pdf_metadata.get("/Creator", ""),
            "producer": pdf_metadata.get("/Producer", ""),
            "creation_date": pdf_metadata.get("/CreationDate", ""),
            "mod_date": pdf_metadata.get("/ModDate", ""),
            "source": str(Path(self.file_path).resolve())
        }

        # 用 PyPDFLoader 加载文本
        text_loader = PyPDFLoader(self.file_path)
        pages = text_loader.load()

        # 给每一页补充真实元数据
        documents = []
        for i, page in enumerate(pages):
            enriched_metadata = dict(metadata_common)  # 复制一份公共 metadata
            enriched_metadata["page"] = page.metadata.get("page", i)
            enriched_metadata["keyword"] = '这里假装通过LLM提取或者NER提取关键字'
            documents.append(Document(
                page_content=page.page_content,
                metadata=enriched_metadata
            ))

        return documents


path = "../data/Understanding_Climate_Change.pdf"
loader = CustomPDFLoader(path)
docs = loader.load()

# 打印第1页的内容和元信息
print("内容:\n", docs[0].page_content[:300])  # 只打印前300个字符
print("\n元信息:\n", docs[0].metadata)

# mcp_server.py from datetime import datetime from mcp.server.fastmcp import FastMCP import logging import os import asyncio import hashlib import json import threading import time import numpy as np import faiss from langchain_community.docstore.in_memory import InMemoryDocstore from langchain_community.vectorstores import FAISS from langchain_community.llms import OpenAIChat from langchain.chains import RetrievalQA from ollama_embeding import CustomEmbeding from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler from langchain_community.document_loaders import ( TextLoader, PyPDFLoader, Docx2txtLoader, UnstructuredPowerPointLoader, UnstructuredExcelLoader, CSVLoader, UnstructuredHTMLLoader, UnstructuredMarkdownLoader, UnstructuredEmailLoader, UnstructuredFileLoader ) # 配置日志记录器 logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s") logger = logging.getLogger(__name__) # 创建 FastMCP 实例 mcp = FastMCP("VectorService") class VectorService: def __init__(self): self.embedding_function = CustomEmbeding('shaw/dmeta-embedding-zh') self.docstore = InMemoryDocstore() self.index = faiss.IndexFlatL2(768) self.vector_store = None self.existing_index_path = "E:/llm_rag/faiss_index/index.faiss" self.existing_index_pkl_path = "E:/llm_rag/faiss_index/index.pkl" self.is_processing = False self.last_processed_count = 0 self.initialized = False # 添加初始化完成标志 self.load_or_init_vector_store() # 初始化向量存储 self.is_initialized = True # 初始化完成 def load_or_init_vector_store(self): if self.vector_store is not None: return self.vector_store # 已初始化 if os.path.exists(self.existing_index_path) and os.path.exists(self.existing_index_pkl_path): vector_store = FAISS.load_local( "E:/llm_rag/faiss_index", embeddings=self.embedding_function, allow_dangerous_deserialization=True ) logger.info("Loaded existing vector store.") self.vector_store = vector_store return vector_store else: vector_store = FAISS( embedding_function=self.embedding_function, index=self.index, docstore=self.docstore, index_to_docstore_id={} ) logger.info("Initialized new vector store.") self.vector_store = vector_store return vector_store def get_id(self, file_path): """Generate file id""" return hashlib.md5(file_path.encode()).hexdigest() def load_document(self, file_path: str): file_ext = file_path.split('.')[-1].lower() logger.info(f"Loading document from {file_path}") loader_map = { 'txt': TextLoader, 'pdf': PyPDFLoader, 'docx': Docx2txtLoader, 'pptx': UnstructuredPowerPointLoader, 'xlsx': UnstructuredExcelLoader, 'csv': CSVLoader, 'html': UnstructuredHTMLLoader, 'htm': UnstructuredHTMLLoader, 'md': UnstructuredMarkdownLoader, 'eml': UnstructuredEmailLoader, 'msg': UnstructuredEmailLoader } if file_ext not in loader_map: logger.warning(f"Unsupported file type: {file_ext}") return None loader_class = loader_map.get(file_ext, UnstructuredFileLoader) loader = loader_class(file_path) try: documents = loader.load() logger.info(f"Loaded {len(documents)} documents from {file_path}") return documents except Exception as e: logger.error(f"Error loading {file_path}: {str(e)}") return None def _add_vector_metadata(self, file_name, file_name_path): """ 添加文件元数据 :return: """ docs = [] metadatas = [] try: file_stats = os.stat(file_name_path) file_size = file_stats.st_size res = self.load_document(file_name_path) if res: # 生成文件唯一标识(使用文件路径的哈希值) id = self.get_id(file_name_path) for doc in res: # 合并用户提供的元数据和文档自身的元数据 doc_metadata = doc.metadata.copy() doc_metadata.update({ "source": file_name, "file_path": file_name_path, "id": id, "upload_time": datetime.now().isoformat() }) # docs.append(doc.page_content.strip()) # 将文件名融入内容(提高文件名的权重) enhanced_content = f"文件名: {file_name}\n内容: {doc.page_content.strip()}" docs.append(enhanced_content) metadatas.append(doc_metadata) logger.info(f"Processed {file_name} ({file_size / (1024 * 1024.0):.2f} MB)") except Exception as e: logger.error(f"Error processing {file_name_path}: {str(e)}") return docs, metadatas def process_documents(self, data_path: str): """把所有文件进行批量向量化,添加文件唯一标识""" try: self.is_processing = True all_docs = [] all_metadatas = [] for root, dirs, files in os.walk(data_path): for file_name in files: file_name_path = os.path.join(root, file_name) logger.info(f"Processing file: {file_name_path}") # 调用 _add_vector_metadata 处理文件 docs, metadatas = self._add_vector_metadata( file_name=file_name, file_name_path=file_name_path ) # 累积结果 all_docs.extend(docs) all_metadatas.extend(metadatas) # 保存所有文件的向量数据 self._save_data_vector(docs=all_docs, metadatas=all_metadatas) self.last_processed_count = len(all_docs) self.is_processing = False return { "status": "success", "message": "Documents processed successfully", "document_count": len(all_docs) } except Exception as e: logger.error(f"Error processing documents: {str(e)}") self.is_processing = False return {"status": "error", "message": str(e)} def _save_data_vector(self, docs, metadatas): """Save the data vector to faiss""" self.vector_store = self.load_or_init_vector_store() docs = [doc for doc in docs if doc] try: logger.info("Starting embedding process...") self.vector_store.add_texts(texts=docs, metadatas=metadatas) logger.info("Embedding process completed.") except Exception as e: logger.error(f"An error occurred during embedding: {str(e)}") try: logger.info("Saving updated vector store...") self.vector_store.save_local("E:/llm_rag/faiss_index") logger.info("Updated vector store saved to E:/llm_rag/faiss_index.") except Exception as e: logger.error(f"An error occurred during saving: {str(e)}") return docs def check_process_status(self): """检查处理状态""" if self.is_processing: return { "status": "processing", "message": "Documents are being processed" } else: if os.path.exists(self.existing_index_path) and os.path.exists(self.existing_index_pkl_path): if self.last_processed_count > 0: return { "status": "success", "message": "Vector data has been updated", "last_processed_count": self.last_processed_count } else: return { "status": "ready", "message": "Vector store exists but no new data processed" } else: return { "status": "empty", "message": "No vector store exists" } def add_vector(self, new_file_name_path: str, new_file_name: str): """添加单个文件的向量""" try: self.is_processing = True docs, metadatas = self._add_vector_metadata( file_name=new_file_name, file_name_path=new_file_name_path ) self._save_data_vector(docs=docs, metadatas=metadatas) self.last_processed_count = len(docs) self.is_processing = False return { "status": "success", "message": "Vector added successfully" } except Exception as e: logger.error(f"Error adding vector: {str(e)}") self.is_processing = False return { "status": "error", "message": str(e) } vector_service = VectorService() @mcp.tool() def process_documents(data_path: str): """处理指定路径下的所有文档并生成向量存储""" logger.info(f"Starting to process documents in {data_path}") return vector_service.process_documents(data_path) @mcp.tool() def check_process_status(): """检查处理状态""" logger.info("Checking process status") return vector_service.check_process_status() @mcp.tool() def add_vector(new_file_name_path: str, new_file_name: str): """添加单个文件的向量""" logger.info(f"Adding vector for file: {new_file_name_path}") return vector_service.add_vector(new_file_name_path, new_file_name) @mcp.tool(name="searchfile", description=f"根据关键词搜索文件并返回匹配的内容") def search_answer(query: str): """ 获取检索相关的文件 :param query: 用户问题 :return: 返回检索到的文档 """ if not vector_service.is_initialized: logger.info("Server is not initialized yet. Please wait.") return {"status": "error", "message": "Server is not initialized yet. Please wait."} logger.info(f"Searching for relevant documents: {query}") try: retriever = FAISS.load_local( "E:/llm_rag/faiss_index", CustomEmbeding('shaw/dmeta-embedding-zh'), allow_dangerous_deserialization=True ).as_retriever(search_kwargs={"k": 10}) docs = retriever.get_relevant_documents(query) logger.info(f"找到 {len(docs)} 个相关文档块") logger.info(f"docs:{docs}") # return docs results = [] for doc in docs: metadata = doc.metadata file_path = metadata.get("file_path", "") # 安全检查:确保文件在允许的目录内 allowed_dir = "E:\\llm_rag\\data\\" if file_path and file_path.startswith(allowed_dir): # 生成相对路径并构建下载URL download_url = os.path.relpath(file_path, allowed_dir) results.append({ "content": doc.page_content, # 文档内容 "download_url": download_url # 下载链接 }) return results except Exception as e: logger.error(f"搜索出错: {str(e)}") return {"status": "error", "message": str(e)} if __name__ == "__main__": mcp.settings.port = 8880 logger.info("Starting mcp server through MCP") mcp.run(transport="sse") # 使用标准输入输出通信 请根据上述的报错修改并列出完整代码,只修改关键错误部分,不要修改其他代码
06-19
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值