5. 数据交互基础：从文本加载到向量存储的完整流程

最新推荐文章于 2025-04-30 09:53:35 发布

原创最新推荐文章于 2025-04-30 09:53:35 发布

· 702 阅读

5 ·

版权

文章标签：

#microsoft

LangChain实战全攻略：从入门到精通构建智能应用专栏收录该内容

21 篇文章

订阅专栏

引言：数据是AI的“燃料”，但如何炼油？

2025年某法律科技公司因合同处理流程低效，耗时从3小时缩短至10分钟，核心在于构建了自动化数据流水线。本文将手把手教你用LangChain + Deepseek-R1实现从原始文本到向量化存储的全流程，并解决行业级数据处理难题。

一、数据交互四部曲：从混沌到结构化

1.1 核心流程全景图

1.2 工具链选型指南（2025版）

环节	推荐工具	适用场景
加载	`TextLoader`/`UnstructuredLoader`	多格式文件读取
分块	`RecursiveCharacterTextSplitter`	通用文本分割
向量化	`OllamaEmbeddings`	本地模型轻量化部署
存储	`FAISS`	本地快速检索

二、实战：构建法律合同分析流水线

2.1 文本加载与清洗

from langchain_unstructured import UnstructuredLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# 加载PDF合同
loader = UnstructuredLoader("中华人民共和国合同法.pdf", mode="elements")
documents = loader.load()

# 文本分块（法律条款专用参数）
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n第", "条", "\n"]
)
chunks = splitter.split_documents(documents)

2.2 向量化与本地存储（FAISS版）

from langchain_ollama import OllamaEmbeddings
from langchain_community.vectorstores import FAISS

# 初始化本地向量模型
embeddings = OllamaEmbeddings(model="deepseek-r1")

# 构建FAISS本地索引
vector_db = FAISS.from_documents(
    chunks,
    embeddings
)

# 保存索引到本地（无需数据库服务）
vector_db.save_local("./faiss_legal_index")

# 检索示例
query = "什么是借款合同？"
results = vector_db.similarity_search(query, k=3)
for result in results:
    print(result.page_content)

输出为：

WARNING: CropBox missing from /Page, defaulting to MediaBox
INFO: HTTP Request: POST http://0.0.0.0:8434/api/embed "HTTP/1.1 200 OK"
INFO: Loading faiss with AVX512 support.
INFO: Successfully loaded faiss with AVX512 support.
INFO: Failed to load GPU Faiss: name 'GpuIndexIVFFlat' is not defined. Will not load constructor refs for GPU indexes.
INFO: HTTP Request: POST http://0.0.0.0:8434/api/embed "HTTP/1.1 200 OK"
法律、行政法规规定的权利和义务订立合同。
府指导价的，按照规定履行。
条款。

2.3 自动化更新策略（FAISS版）

class AutoUpdateFAISS:  
    def __init__(self):  
        self.embeddings = OllamaEmbeddings(model="deepseek-r1")  
        self.db = None  

    def load_index(self, path: str):  
        self.db = FAISS.load_local(  
            folder_path=path,  
            embeddings=self.embeddings,  
            allow_dangerous_deserialization=True  # 显式允许本地加载  
        )  

    def add_file(self, file_path: str):  
        loader = UnstructuredLoader(file_path)  
        chunks = splitter.split_documents(loader.load())  
        if self.db is None:  
            self.db = FAISS.from_documents(chunks, self.embeddings)  
        else:  
            self.db.add_documents(chunks)  

    def delete_by_source(self, source: str):  
        # FAISS需手动过滤删除  
        self.db.index.remove_ids([i for i, doc in enumerate(self.db.docstore._dict.values()) if doc.metadata["source"] == source])

三、行业痛点解决方案

3.1 多格式文件兼容

问题：扫描版PDF文字提取混乱
方案：组合使用OCR与版面分析算法

# 使用OCR处理扫描件  
from langchain_community.document_loaders import UnstructuredPDFLoader  

loader = UnstructuredPDFLoader(  
    "scanned_contract.pdf",  
    strategy="ocr_only",  # 强制启用OCR  
    infer_table_structure=True  
)

3.2 长文本语义连贯性

问题：合同条款被错误分割
方案：自定义分割逻辑

class LegalTextSplitter(RecursiveCharacterTextSplitter):  
    def __init__(self):  
        super().__init__(  
            separators=["\n\n第", "条\n", "。\n"],  
            keep_separator=True  # 保留分隔符维持上下文  
        )

四、性能优化：本地化部署技巧

4.1 多线程加速

from multiprocessing import Pool  

def process_file(file_path):  
    loader = UnstructuredFileLoader(file_path)  
    chunks = splitter.split_documents(loader.load())  
    vector_db.add_documents(chunks)  

with Pool(8) as p:  
    p.map(process_file, file_list)