LangChain文档处理的批量化与并行化深入解析
一、LangChain文档处理基础架构
1.1 文档处理核心组件
LangChain文档处理系统基于几个关键抽象组件构建:
- Document类:表示文档的基础数据结构
class Document(BaseModel):
"""表示文档的基础数据结构"""
page_content: str # 文档文本内容
metadata: dict # 元数据,如来源、页码等
lookup_str: str = "" # 用于检索的字符串
lookup_index: int = 0 # 检索索引
- DocumentLoader接口:负责从不同来源加载文档
class BaseLoader(ABC):
"""文档加载器的基础接口"""
@abstractmethod
def load(self) -> List[Document]:
"""加载文档并返回Document列表"""
pass
async def aload(self) -> List[Document]:
"""异步加载文档的方法"""
raise NotImplementedError("异步加载方法未实现")
- TextSplitter接口:将文档分割为更小的块
class TextSplitter(ABC):
"""文本分割器的基础接口"""
def split_text(self, text: str) -> List[str]:
"""将文本分割为多个片段"""
pass
def split_documents(self, documents: List[Document]) -> List[Document]:
"""将文档列表分割为多个小文档"""
pass
- VectorStore接口:文档向量化存储与检索
class VectorStore(ABC):
"""向量存储的基础接口"""
@abstractmethod
def add_texts(
self,
texts: Iterable[str],
metadatas: Optional[List[dict]] = None,
**kwargs: Any,
) -> List[str]:
"""添加文本到向量存储"""
pass
@abstractmethod
def similarity_search(
self, query: str, k: int = 4, **kwargs: Any
) -> List[Document]:
"""基于相似度检索文档"""
pass
1.2 文档处理流程
典型的文档处理流程包含以下步骤:
- 加载阶段:通过特定的加载器从文件系统、网络或其他数据源获取原始文档
- 分割阶段:将长文档分割为合适大小的文本块
- 向量化阶段:使用Embedding模型将文本块转换为向量表示
- 存储阶段:将向量和元数据存入向量数据库
- 检索阶段:基于查询向量检索相关文档片段
1.3 批量化与并行化的必要性
在处理大规模文档时,顺序处理面临以下挑战:
- 性能瓶颈:单线程处理大量文档耗时过长
- 资源利用率低:现代多核CPU和GPU资源未充分利用
- 扩展性差:难以应对不断增长的文档处理需求
批量化和并行化技术通过以下方式解决这些问题:
- 批量化:将多个文档或操作组合成批次处理,减少上下文切换开销
- 并行化:利用多核CPU和GPU同时处理多个任务,提高吞吐量
- 异步处理:非阻塞I/O操作,充分利用等待时间处理其他任务
二、文档加载的批量化与并行化
2.1 顺序加载的局限性
传统的文档加载方式是逐个文件顺序处理:
def load_documents_sequential(file_paths: List[str]) -> List[Document]:
documents = []
for file_path in file_paths:
# 创建对应的加载器
if file_path.endswith('.pdf'):
loader = PyPDFLoader(file_path)
elif file_path.endswith('.txt'):
loader = TextLoader(file_path)
# 其他文件类型的加载器...
# 加载单个文档
loaded_docs = loader.load()
documents.extend(loaded_docs)
return documents
这种方式存在明显缺陷:
- I/O密集型操作导致CPU利用率低
- 单个加载失败可能影响整个流程
- 处理大量文件时耗时过长
2.2 并行加载实现
LangChain通过多种方式实现并行文档加载:
- 多线程加载器
class ThreadedLoader(BaseLoader):
"""使用多线程并行加载文档的加载器"""
def __init__(self, loaders: List[BaseLoader], num_workers: int = 4):
self.loaders = loaders
self.num_workers = num_workers
def load(self) -> List[Document]:
"""使用线程池并行加载文档"""
with ThreadPoolExecutor(max_workers=self.num_workers) as executor:
# 提交所有加载任务到线程池
future_to_loader = {
executor.submit(loader.load): loader for loader in self.loaders
}
documents = []
# 收集结果
for future in as_completed(future_to_loader):
try:
loaded_docs = future.result()
documents.extend(loaded_docs)
except Exception as e:
logger.error(f"加载文档失败: {e}")
return documents
- 异步加载器
class AsyncLoader(BaseLoader):
"""使用异步I/O加载文档的加载器"""
async def aload(self) -> List[Document]:
"""异步加载文档"""
tasks = []
for loader in self.loaders:
if hasattr(loader, 'aload'):
# 使用加载器的异步方法
tasks.append(loader.aload())
else:
# 对同步方法进行包装
tasks.append(run_in_executor(None, loader.load))
# 并发执行所有任务
results = await asyncio.gather(*tasks)
# 合并结果
documents = []
for result in results:
documents.extend(result)
return documents
2.3 批量化加载优化
除了并行处理,LangChain还通过批量化技术优化加载过程:
- 批量文件扫描
def scan_files_batch(directory: str, batch_size: int = 100) -> Iterator[List[str]]:
"""分批扫描目录中的文件"""
batch = []
for root, _, files in os.walk(directory):
for file in files:
file_path = os.path.join(root, file)
batch.append(file_path)
if len(batch) >= batch_size:
yield batch
batch = []
# 返回剩余文件
if batch:
yield batch
- 批量加载处理
def process_documents_in_batches(file_paths: List[str], batch_size: int = 50):
"""分批处理文档"""
for batch in scan_files_batch(file_paths, batch_size):
# 创建批量加载器
loaders = [create_loader(file_path) for file_path in batch]
batch_loader = ThreadedLoader(loaders, num_workers=8)
# 加载当前批次文档
documents = batch_loader.load()
# 处理加载的文档
process_documents(documents)
三、文本分割的批量化与并行化
3.1 基本文本分割原理
文本分割是将长文档拆分为较小片段的过程,核心接口定义如下:
class TextSplitter(ABC):
"""文本分割器基类"""
def __init__(
self,
chunk_size: int = 4000,
chunk_overlap: int = 200,
length_function: Callable[[str], int] = len,
):
"""初始化文本分割器"""
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.length_function = length_function
@abstractmethod
def split_text(self, text: str) -> List[str]:
"""将文本分割为多个片段"""
pass
def split_documents(self, documents: List[Document]) -> List[Document]:
"""分割文档列表"""
return [
Document(
page_content=chunk,
metadata=metadata,
)
for doc in documents
for chunk, metadata in zip(
self.split_text(doc.page_content),
self._get_metadata_per_chunk(doc.metadata)
)
]
3.2 批量化文本分割实现
LangChain提供了多种批量化分割策略:
- 基于线程池的并行分割
def split_documents_parallel(
splitter: TextSplitter,
documents: List[Document],
num_workers: int = 4
) -> List[Document]:
"""使用线程池并行分割文档"""
with ThreadPoolExecutor(max_workers=num_workers) as executor:
# 分割任务列表
tasks = [
executor.submit(splitter.split_documents, [doc])
for doc in documents
]
# 收集结果
all_chunks = []
for future in as_completed(tasks):
try:
chunks = future.result()
all_chunks.extend(chunks)
except Exception as e:
logger.error(f"分割文档失败: {e}")
return all_chunks
- 基于批次的分割优化
def split_documents_in_batches(
splitter: TextSplitter,
documents: List[Document],
batch_size: int = 50
) -> List[Document]:
"""分批分割文档"""
all_chunks = []
for i in range(0, len(documents), batch_size):
batch = documents[i:i+batch_size]
# 批量分割
chunks = splitter.split_documents(batch)
all_chunks.extend(chunks)
return all_chunks
3.3 高级分割策略
针对特殊类型文档,LangChain提供了专门的分割器:
- 递归文本分割器
class RecursiveCharacterTextSplitter(TextSplitter):
"""递归地按字符分割文本"""
def __init__(
self,
separators: Optional[List[str]] = None,
**kwargs: Any,
):
"""初始化递归文本分割器"""
super().__init__(**kwargs)
self.separators = separators or ["\n\n", "\n", " ", ""]
def split_text(self, text: str) -> List[str]:
"""递归分割文本"""
final_chunks = []
# 如果文本已经足够短,直接返回
if self.length_function(text) <= self.chunk_size:
return [text]
# 尝试按每个分隔符进行分割
for separator in self.separators:
if separator == "":
separator = None
# 尝试分割
if separator is None:
splits = [text]
else:
splits = text.split(separator)
# 如果分割出多个部分
if len(splits) > 1:
# 对每个分割部分递归处理
for s in splits:
if self.length_function(s) <= self.chunk_size:
final_chunks.append(s)
else:
# 递归分割
sub_splits = self.split_text(s)
final_chunks.extend(sub_splits)
return final_chunks
# 如果无法分割,返回原始文本(虽然超过了chunk_size)
return [text]
- 标记级分割器
class TokenTextSplitter(TextSplitter):
"""基于token的文本分割器"""
def __init__(
self,
encoding_name: str = "gpt2",
**kwargs: Any,
):
"""初始化token分割器"""
super().__init__(**kwargs)
try:
import tiktoken
except ImportError:
raise ValueError(
"使用TokenTextSplitter需要安装tiktoken库: "
"pip install tiktoken"
)
self.encoding = tiktoken.get_encoding(encoding_name)
self.encoding_name = encoding_name
def split_text(self, text: str) -> List[str]:
"""基于token分割文本"""
tokens = self.encoding.encode(text)
# 按token数量分割
chunks = []
current_chunk = []
current_length = 0
for token in tokens:
# 如果添加当前token会超过chunk_size,则创建新chunk
if current_length + 1 > self.chunk_size:
# 如果当前chunk为空,则强制添加
if len(current_chunk) == 0:
current_chunk.append(token)
current_length += 1
else:
# 创建新chunk
chunks.append(current_chunk)
current_chunk = [token]
current_length = 1
else:
# 添加到当前chunk
current_chunk.append(token)
current_length += 1
# 添加最后一个chunk
if current_chunk:
chunks.append(current_chunk)
# 解码回文本
return [self.encoding.decode(chunk) for chunk in chunks]
四、向量化处理的批量化与并行化
4.1 基本向量化流程
文本向量化是将文本转换为数值向量的过程:
class Embeddings(ABC):
"""向量化模型的基础接口"""
@abstractmethod
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""将多个文档转换为向量"""
pass
@abstractmethod
def embed_query(self, text: str) -> List[float]:
"""将查询文本转换为向量"""
pass
4.2 批量化向量化实现
大多数Embedding模型支持批量处理,以提高效率:
class OpenAIEmbeddings(Embeddings):
"""OpenAI文本向量化实现"""
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""批量向量化文档"""
# 按批次处理,避免请求过大
batches = [
texts[i:i+self.chunk_size]
for i in range(0, len(texts), self.chunk_size)
]
embeddings = []
for batch in batches:
# 调用OpenAI API进行向量化
response = openai.Embedding.create(
input=batch,
model=self.model_name,
)
# 提取向量
batch_embeddings = [
record["embedding"] for record in response["data"]
]
embeddings.extend(batch_embeddings)
return embeddings
4.3 并行向量化优化
对于支持并行计算的Embedding模型,可以进一步优化:
class ParallelEmbeddings(Embeddings):
"""并行向量化处理器"""
def __init__(
self,
embeddings: Embeddings,
workers: int = 4,
):
"""初始化并行向量化处理器"""
self.embeddings = embeddings
self.workers = workers
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""并行向量化文档"""
# 分割文本为多个批次
batches = [
texts[i:i+self.workers]
for i in range(0, len(texts), self.workers)
]
all_embeddings = []
# 使用进程池并行处理批次
with ProcessPoolExecutor(max_workers=self.workers) as executor:
# 提交所有批次的向量化任务
futures = [
executor.submit(self.embeddings.embed_documents, batch)
for batch in batches
]
# 收集结果
for future in as_completed(futures):
batch_embeddings = future.result()
all_embeddings.extend(batch_embeddings)
return all_embeddings
五、向量存储的批量化与并行化
5.1 向量存储基础接口
向量存储负责高效存储和检索向量数据:
class VectorStore(ABC):
"""向量存储的基础接口"""
@abstractmethod
def add_texts(
self,
texts: Iterable[str],
metadatas: Optional[List[dict]] = None,
embedding: Optional[Embeddings] = None,
**kwargs: Any,
) -> List[str]:
"""添加文本及其向量表示到存储"""
pass
@abstractmethod
def similarity_search(
self, query: str, k: int = 4, **kwargs: Any
) -> List[Document]:
"""基于相似度检索文档"""
pass
5.2 批量化向量存储实现
大多数向量数据库支持批量插入操作:
class FAISS(VectorStore):
"""FAISS向量存储实现"""
def add_embeddings(
self,
texts: List[str],
embeddings: List[List[float]],
metadatas: Optional[List[dict]] = None,
**kwargs: Any,
) -> List[str]:
"""批量添加向量及其元数据"""
# 生成唯一ID
ids = [str(uuid.uuid4()) for _ in range(len(texts))]
# 构建文档和向量
documents = []
vectors = []
for i, (text, embedding, metadata) in enumerate(zip(texts, embeddings, metadatas or [])):
documents.append(Document(page_content=text, metadata=metadata or {}))
vectors.append(np.array(embedding, dtype=np.float32))
# 转换为numpy数组
vector_array = np.vstack(vectors)
# 添加到FAISS索引
self.index.add(vector_array)
# 保存元数据
self.docstore.add({id_: doc for id_, doc in zip(ids, documents)})
return ids
5.3 并行向量检索优化
在检索阶段,并行化可以加速结果获取:
class ParallelVectorStore(VectorStore):
"""并行向量存储包装器"""
def __init__(
self,
vectorstore: VectorStore,
workers: int = 4,
):
"""初始化并行向量存储"""
self.vectorstore = vectorstore
self.workers = workers
async def asimilarity_search(
self, queries: List[str], k: int = 4, **kwargs: Any
) -> List[List[Document]]:
"""异步并行相似度检索"""
# 创建任务列表
tasks = [
asyncio.to_thread(
self.vectorstore.similarity_search,
query,
k=k,
**kwargs
)
for query in queries
]
# 并发执行所有任务
return await asyncio.gather(*tasks)
六、文档处理流水线的批量化与并行化
6.1 基本处理流水线
LangChain通过Pipeline模式组合各个处理组件:
class DocumentProcessorPipeline:
"""文档处理流水线"""
def __init__(
self,
loader: BaseLoader,
splitter: TextSplitter,
embeddings: Embeddings,
vectorstore: VectorStore,
):
"""初始化文档处理流水线"""
self.loader = loader
self.splitter = splitter
self.embeddings = embeddings
self.vectorstore = vectorstore
def process(self) -> VectorStore:
"""执行完整的文档处理流程"""
# 1. 加载文档
documents = self.loader.load()
# 2. 分割文档
split_docs = self.splitter.split_documents(documents)
# 3. 向量化
texts = [doc.page_content for doc in split_docs]
embeddings = self.embeddings.embed_documents(texts)
# 4. 存储向量
self.vectorstore.add_embeddings(
texts=texts,
embeddings=embeddings,
metadatas=[doc.metadata for doc in split_docs],
)
return self.vectorstore
6.2 批量化流水线实现
通过批处理优化整个流程:
class BatchDocumentProcessorPipeline:
"""批量化文档处理流水线"""
def __init__(
self,
loader: BaseLoader,
splitter: TextSplitter,
embeddings: Embeddings,
vectorstore: VectorStore,
batch_size: int = 50,
workers: int = 4,
):
"""初始化批量化文档处理流水线"""
self.loader = loader
self.splitter = splitter
self.embeddings = embeddings
self.vectorstore = vectorstore
self.batch_size = batch_size
self.workers = workers
def process(self) -> VectorStore:
"""批量化处理文档"""
# 1. 加载文档
documents = self.loader.load()
# 2. 分批处理
for i in range(0, len(documents), self.batch_size):
batch_docs = documents[i:i+self.batch_size]
# 2.1 分割文档(并行)
split_docs = split_documents_parallel(
self.splitter, batch_docs, num_workers=self.workers
)
# 2.2 向量化(并行)
texts = [doc.page_content for doc in split_docs]
batch_embeddings = self.embeddings.embed_documents(texts)
# 2.3 存储向量
self.vectorstore.add_embeddings(
texts=texts,
embeddings=batch_embeddings,
metadatas=[doc.metadata for doc in split_docs],
)
return self.vectorstore
6.3 流式处理优化
对于超大规模文档集,采用流式处理更高效:
class StreamDocumentProcessor:
"""流式文档处理器"""
def __init__(
self,
loader: BaseLoader,
splitter: TextSplitter,
embeddings: Embeddings,
vectorstore: VectorStore,
buffer_size: int = 100,
):
"""初始化流式文档处理器"""
self.loader = loader
self.splitter = splitter
self.embeddings = embeddings
self.vectorstore = vectorstore
self.buffer_size = buffer_size
def process(self) -> VectorStore:
"""流式处理文档"""
# 创建迭代器加载文档
doc_iterator = self.loader.lazy_load()
# 初始化缓冲区
buffer = []
for doc in doc_iterator:
# 添加到缓冲区
buffer.append(doc)
# 当缓冲区达到大小或文档迭代完成时处理
if len(buffer) >= self.buffer_size:
self._process_batch(buffer)
buffer = []
# 处理剩余文档
if buffer:
self._process_batch(buffer)
return self.vectorstore
def _process_batch(self, documents: List[Document]) -> None:
"""处理一批文档"""
# 分割文档
split_docs = self.splitter.split_documents(documents)
# 向量化
texts = [doc.page_content for doc in split_docs]
embeddings = self.embeddings.embed_documents(texts)
# 存储向量
self.vectorstore.add_embeddings(
texts=texts,
embeddings=embeddings,
metadatas=[doc.metadata for doc in split_docs],
)
七、性能监控与调优
7.1 批量化与并行化性能指标
关键性能指标包括:
- 吞吐量(Throughput):单位时间内处理的文档数量
- 延迟(Latency):单个文档从处理开始到结束的时间
- 资源利用率:CPU、内存、GPU等资源的使用效率
- 加速比(Speedup):并行处理相对于顺序处理的性能提升倍数
7.2 性能监控工具
LangChain提供了多种性能监控工具:
- 回调系统(Callback System)
class BaseCallbackHandler(ABC):
"""回调处理器基类"""
def on_llm_start(
self, serialized: Dict[str, Any], prompts: List[str], **kwargs: Any
) -> Any:
"""LLM开始运行时调用"""
pass
def on_llm_end(self, response: LLMResult, **kwargs: Any) -> Any:
"""LLM结束运行时调用"""
pass
def on_chain_start(
self, serialized: Dict[str, Any], inputs: Dict[str, Any], **kwargs: Any
) -> Any:
"""Chain开始运行时调用"""
pass
def on_chain_end(self, outputs: Dict[str, Any], **kwargs: Any) -> Any:
"""Chain结束运行时调用"""
pass
def on_tool_start(
self, serialized: Dict[str, Any], input_str: str, **kwargs: Any
) -> Any:
"""Tool开始运行时调用"""
pass
def on_tool_end(self, output: str, **kwargs: Any) -> Any:
"""Tool结束运行时调用"""
pass
- 性能记录器
class PerformanceCallbackHandler(BaseCallbackHandler):
"""性能监控回调处理器"""
def __init__(self):
"""初始化性能监控回调处理器"""
self.llm_start_times = {}
self.chain_start_times = {}
self.tool_start_times = {}
self.performance_data = {
"llm": [],
"chain": [],
"tool": []
}
def on_llm_start(self, serialized: Dict[str, Any], prompts: List[str], **kwargs: Any) -> None:
"""记录LLM开始时间"""
run_id = kwargs.get("run_id")
if run_id:
self.llm_start_times[run_id] = time.time()
def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:
"""记录LLM结束时间并计算耗时"""
run_id = kwargs.get("run_id")
if run_id and run_id in self.llm_start_times:
duration = time.time() - self.llm_start_times[run_id]
self.performance_data["llm"].append({
"run_id": str(run_id),
"duration": duration,
"tokens": response.llm_output.get("token_usage", {})
})
del self.llm_start_times[run_id]
# 类似地实现其他回调方法...
7.3 性能调优策略
- 批大小优化
def optimize_batch_size(
processor: DocumentProcessorPipeline,
min_size: int = 10,
max_size: int = 1000,
step: int = 10,
samples: int = 5
) -> int:
"""优化批处理大小"""
best_size = min_size
best_throughput = 0
for size in range(min_size, max_size + 1, step):
# 临时设置批大小
processor.batch_size = size
# 运行多次采样
throughputs = []
for _ in range(samples):
start_time = time.time()
processor.process()
duration = time.time() - start_time
throughput = len(processor.loader.load()) / duration
throughputs.append(throughput)
# 计算平均吞吐量
avg_throughput = sum(throughputs) / samples
# 更新最佳批大小
if avg_throughput > best_throughput:
best_throughput = avg_throughput
best_size = size
return best_size
- 并行度调优
def optimize_parallel_workers(
processor: DocumentProcessorPipeline,
min_workers: int = 1,
max_workers: int = 16,
samples: int = 5
) -> int:
"""优化并行工作线程数"""
best_workers = min_workers
best_throughput = 0
for workers in range(min_workers, max_workers + 1):
# 临时设置工作线程数
processor.workers = workers
# 运行多次采样
throughputs = []
for _ in range(samples):
start_time = time.time()
processor.process()
duration = time.time() - start_time
throughput = len(processor.loader.load()) / duration
throughputs.append(throughput)
# 计算平均吞吐量
avg_throughput = sum(throughputs) / samples
# 更新最佳工作线程数
if avg_throughput > best_throughput:
best_throughput = avg_throughput
best_workers = workers
return best_workers
八、错误处理与恢复机制
8.1 批处理中的错误处理
在批处理过程中,单个文档处理失败不应影响整个批次:
class FaultTolerantBatchProcessor:
"""容错批处理处理器"""
def __init__(
self,
processor: DocumentProcessorPipeline,
retry_attempts: int = 3,
report_errors: bool = True,
):
"""初始化容错批处理处理器"""
self.processor = processor
self.retry_attempts = retry_attempts
self.report_errors = report_errors
self.failed_docs = []
def process(self) -> VectorStore:
"""容错批处理"""
# 加载文档
documents = self.processor.loader.load()
# 分批处理
for i in range(0, len(documents), self.processor.batch_size):
batch = documents[i:i+self.processor.batch_size]
processed_docs = []
# 处理每个文档,记录失败
for doc in batch:
success = False
attempts = 0
while not success and attempts < self.retry_attempts:
try:
# 处理单个文档
processed = self._process_single_doc(doc)
processed_docs.append(processed)
success = True
except Exception as e:
attempts += 1
if attempts >= self.retry_attempts:
if self.report_errors:
logger.error(f"处理文档失败 (尝试 {attempts}): {e}")
self.failed_docs.append({
"document": doc,
"error": str(e)
})
# 处理成功的文档
if processed_docs:
self._process_successful_batch(processed_docs)
return self.processor.vectorstore
def _process_single_doc(self, doc: Document) -> Document:
"""处理单个文档"""
# 分割文档
split_docs = self.processor.splitter.split_documents([doc])
# 假设只有一个分割结果
return split_docs[0] if split_docs else doc
def _process_successful_batch(self, docs: List[Document]) -> None:
"""处理成功的批次"""
# 向量化
texts = [doc.page_content for doc in docs]
embeddings = self.processor.embeddings.embed_documents(texts)
# 存储向量
self.processor.vectorstore.add_embeddings(
texts=texts,
embeddings=embeddings,
metadatas=[doc.metadata for doc in docs],
)
8.2 并行处理中的错误隔离
在并行处理中,需要隔离错误以避免影响整个系统:
def process_with_error_isolation(
func: Callable,
items: List[Any],
workers: int = 4
) -> Tuple[List[Any], List[Tuple[Any, Exception]]]:
"""带错误隔离的并行处理"""
results = []
errors = []
with ThreadPoolExecutor(max_workers=workers) as executor:
# 提交所有任务
future_to_item = {
executor.submit(func, item): item for item in items
}
# 收集结果
for future in as_completed(future_to_item):
item = future_to_item[future]
try:
result = future.result()
results.append(result)
except Exception as e:
errors.append((item, e))
return results, errors
8.3 断点续传机制
为支持大型处理任务的断点续传:
class ResumableDocumentProcessor:
"""可断点续传的文档处理器"""
def __init__(
self,
processor: DocumentProcessorPipeline,
checkpoint_dir: str = ".langchain_checkpoints",
):
"""初始化可断点续传的文档处理器"""
self.processor = processor
self.checkpoint_dir = checkpoint_dir
self.checkpoint_file = os.path.join(checkpoint_dir, "checkpoint.json")
self.completed_docs = set()
# 加载检查点
self._load_checkpoint()
def process(self) -> VectorStore:
"""带断点续传的处理"""
# 确保检查点目录存在
os.makedirs(self.checkpoint_dir, exist_ok=True)
# 加载文档
documents = self.processor.loader.load()
# 过滤已完成的文档
remaining_docs = [
doc for doc in documents
if self._get_doc_id(doc) not in self.completed_docs
]
# 分批处理剩余文档
for i in range(0, len(remaining_docs), self.processor.batch_size):
batch = remaining_docs[i:i+self.processor.batch_size]
try:
# 处理批次
self._process_batch(batch)
# 更新已完成文档
for doc in batch:
self.completed_docs.add(self._get_doc_id(doc))
# 保存检查点
self._save_checkpoint()
except Exception as e:
logger.error(f"批次处理失败: {e}")
# 可以选择在这里重试或退出
return self.processor.vectorstore
def _process_batch(self, batch: List[Document]) -> None:
"""处理一个批次的文档"""
# 分割文档
split_docs = self.processor.splitter.split_documents(batch)
# 向量化
texts = [doc.page_content for doc in split_docs]
embeddings = self.processor.embeddings.embed_documents(texts)
# 存储向量
self.processor.vectorstore.add_embeddings(
texts=texts,
embeddings=embeddings,
metadatas=[doc.metadata for doc in split_docs],
)
def _get_doc_id(self, doc: Document) -> str:
"""生成文档的唯一ID"""
# 基于文档内容和元数据生成哈希
content_hash = hashlib.sha256(doc.page_content.encode()).hexdigest()
metadata_hash = hashlib.sha256(str(doc.metadata).encode()).hexdigest()
return f"{content_hash}_{metadata_hash}"
def _load_checkpoint(self) -> None:
"""加载检查点"""
if os.path.exists(self.checkpoint_file):
try:
with open(self.checkpoint_file, 'r') as f:
checkpoint_data = json.load(f)
self.completed_docs = set(checkpoint_data.get('completed_docs', []))
except Exception as e:
logger.warning(f"加载检查点失败: {e}")
def _save_checkpoint(self) -> None:
"""保存检查点"""
checkpoint_data = {
'completed_docs': list(self.completed_docs),
'timestamp': time.time()
}
with open(self.checkpoint_file, 'w') as f:
json.dump(checkpoint_data, f)
九、与云服务的集成优化
9.1 分布式处理架构
LangChain可以与云服务集成实现分布式处理:
class DistributedDocumentProcessor:
"""分布式文档处理器"""
def __init__(
self,
processor: DocumentProcessorPipeline,
cloud_provider: str = "aws",
num_workers: int = 10,
):
"""初始化分布式文档处理器"""
self.processor = processor
self.cloud_provider = cloud_provider
self.num_workers = num_workers
# 根据云提供商初始化分布式环境
if cloud_provider == "aws":
self._init_aws_environment()
elif cloud_provider == "gcp":
self._init_gcp_environment()
elif cloud_provider == "azure":
self._init_azure_environment()
else:
raise ValueError(f"不支持的云提供商: {cloud_provider}")
def _init_aws_environment(self) -> None:
"""初始化AWS环境"""
import boto3
# 创建S3客户端用于存储中间结果
self.s3_client = boto3.client('s3')
# 创建AWS Batch客户端用于分布式任务
self.batch_client = boto3.client('batch')
# 其他初始化...
def process(self) -> VectorStore:
"""分布式处理文档"""
# 1. 准备输入数据并上传到云存储
input_bucket, input_prefix = self._prepare_input_data()
# 2. 创建分布式任务
job_definition = self._create_job_definition()
job_queue = self._get_job_queue()
# 3. 拆分文档并提交任务
documents = self.processor.loader.load()
batch_size = len(documents) // self.num_workers
for i in range(0, len(documents), batch_size):
batch = documents[i:i+batch_size]
batch_id = f"batch_{i//batch_size}"
# 上传批次数据到云存储
batch_s3_path = self._upload_batch_to_s3(batch, input_bucket, input_prefix, batch_id)
# 提交处理任务
self._submit_batch_job(
job_definition=job_definition,
job_queue=job_queue,
batch_id=batch_id,
input_path=batch_s3_path
)
# 4. 等待所有任务完成
self._wait_for_jobs_completion()
# 5. 合并结果
output_bucket, output_prefix = self._get_output_location()
self._merge_results(output_bucket, output_prefix)
# 6. 加载结果到向量存储
self._load_results_to_vectorstore(output_bucket, output_prefix)
return self.processor.vectorstore
9.2 云服务特定优化
针对不同云服务的优化策略:
- AWS S3批量操作
class S3BatchLoader(BaseLoader):
"""从S3批量加载文档的加载器"""
def __init__(
self,
bucket: str,
prefix: str = "",
s3_client: Optional[boto3.client] = None,
):
"""初始化S3批量加载器"""
self.bucket = bucket
self.prefix = prefix
self.s3_client = s3_client or boto3.client('s3')
def load(self) -> List[Document]:
"""从S3批量加载文档"""
documents = []
# 列出S3对象
paginator = self.s3_client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=self.bucket, Prefix=self.prefix)
for page in pages:
if 'Contents' in page:
for obj in page['Contents']:
key = obj['Key']
# 跳过目录
if key.endswith('/'):
continue
# 确定加载器类型
if key.endswith('.pdf'):
loader = PyPDFLoader(f"s3://{self.bucket}/{key}")
elif key.endswith('.txt'):
loader = TextLoader(f"s3://{self.bucket}/{key}")
else:
logger.warning(f"未知文件类型: {key}")
continue
# 加载文档
try:
docs = loader.load()
documents.extend(docs)
except Exception as e:
logger.error(f"加载S3文件失败 ({key}): {e}")
return documents
- GCP Dataflow集成
class DataflowDocumentProcessor:
"""使用GCP Dataflow进行分布式文档处理"""
def __init__(
self,
processor: DocumentProcessorPipeline,
project_id: str,
region: str = "us-central1",
):
"""初始化Dataflow文档处理器"""
self.processor = processor
self.project_id = project_id
self.region = region
def process(self) -> VectorStore:
"""使用Dataflow处理文档"""
import apache_beam as beam
# 创建Dataflow管道
options = beam.options.pipeline_options.PipelineOptions(
runner='DataflowRunner',
project=self.project_id,
region=self.region,
temp_location=f'gs://{self.project_id}/temp',
staging_location=f'gs://{self.project_id}/staging',
)
with beam.Pipeline(options=options) as p:
# 1. 从源读取文档
documents = p | 'ReadDocuments' >> self._read_documents()
# 2. 分割文档
split_docs = documents | 'SplitDocuments' >> beam.ParDo(
SplitDocumentsDoFn(self.processor.splitter)
)
# 3. 向量化
vectorized_docs = split_docs | 'VectorizeDocuments' >> beam.ParDo(
VectorizeDocumentsDoFn(self.processor.embeddings)
)
# 4. 写入结果到GCS
vectorized_docs | 'WriteToGCS' >> beam.io.WriteToText(
f'gs://{self.project_id}/output/documents',
num_shards=10
)
# 5. 从GCS加载结果到向量存储
self._load_results_from_gcs(f'gs://{self.project_id}/output')
return self.processor.vectorstore
def _read_documents(self) -> beam.PTransform:
"""创建读取文档的PTransform"""
# 根据数据源类型返回不同的PTransform
if isinstance(self.processor.loader, DirectoryLoader):
return beam.io.ReadFromText(self.processor.loader.directory)
elif isinstance(self.processor.loader, S3Loader):
return beam.io.ReadFromText(f"s3://{self.processor.loader.bucket}/{self.processor.loader.prefix}")
else:
raise ValueError(f"不支持的加载器类型: {type(self.processor.loader)}")
十、文档处理的应用案例
10.1 企业知识库构建
在企业知识库构建场景中,批量化与并行化处理至关重要:
def build_enterprise_knowledge_base(
document_dirs: List[str],
vectorstore: VectorStore,
workers: int = 8,
batch_size: int = 100
) -> VectorStore:
"""构建企业知识库"""
# 创建文档加载器
loaders = []
for dir_path in document_dirs:
loaders.append(DirectoryLoader(dir_path))
# 组合所有加载器
combined_loader = CombineLoader(loaders)
# 创建文本分割器
splitter = RecursiveCharacterTextSplitter(
chunk_size=4000,
chunk_overlap=200
)
# 创建向量化模型
embeddings = OpenAIEmbeddings()
# 创建批处理流水线
pipeline = BatchDocumentProcessorPipeline(
loader=combined_loader,
splitter=splitter,
embeddings=embeddings,
vectorstore=vectorstore,
batch_size=batch_size,
workers=workers
)
# 执行批处理
return pipeline.process()
10.2 法律文档分析系统
在法律文档分析系统中,高效处理大量文档是关键:
class LegalDocumentAnalyzer:
"""法律文档分析系统"""
def __init__(
self,
document_dir: str,
workers: int = 16,
batch_size: int = 50
):
"""初始化法律文档分析系统"""
# 创建向量存储
self.vectorstore = Chroma(embedding_function=OpenAIEmbeddings())
# 创建批处理处理器
self.processor = BatchDocumentProcessorPipeline(
loader=DirectoryLoader(document_dir),
splitter=RecursiveCharacterTextSplitter(
chunk_size=2000,
chunk_overlap=200
),
embeddings=OpenAIEmbeddings(),
vectorstore=self.vectorstore,
batch_size=batch_size,
workers=workers
)
def analyze(self, query: str, k: int = 10) -> List[Document]:
"""分析法律文档并返回相关内容"""
# 先处理文档(如果尚未处理)
if not self.vectorstore.get():
self.processor.process()
# 执行相似度检索
results = self.vectorstore.similarity_search(query, k=k)
# 进一步分析结果
analyzed_results = self._analyze_results(results, query)
return analyzed_results
def _analyze_results(self, results: List[Document], query: str) -> List[Document]:
"""进一步分析检索结果"""
# 创建LLM用于分析
llm = OpenAI(temperature=0)
# 批量分析结果
analysis_results = []
for doc in results:
# 构建提示
prompt = f"""
你是一位专业法律分析师。请分析以下法律文档片段,
并回答用户的问题:{query}
文档片段:
{doc.page_content}
请提供分析结果:
"""
# 获取LLM分析结果
analysis = llm(prompt)
# 更新文档
doc.metadata["analysis"] = analysis
analysis_results.append(doc)
return analysis_results
10.3 学术研究助手
在学术研究助手中,快速处理大量论文是核心需求:
class AcademicResearchAssistant:
"""学术研究助手"""
def __init__(
self,
papers_dir: str,
workers: int = 8,
batch_size: int = 20
):
"""初始化学术研究助手"""
# 创建向量存储
self.vectorstore = FAISS(embedding_function=HuggingFaceEmbeddings())
# 创建批处理处理器
self.processor = BatchDocumentProcessorPipeline(
loader=DirectoryLoader(papers_dir),
splitter=RecursiveCharacterTextSplitter(
chunk_size=3000,
chunk_overlap=300
),
embeddings=HuggingFaceEmbeddings(),
vectorstore=self.vectorstore,
batch_size=batch_size,
workers=workers
)
async def summarize_research(self, topic: str, limit: int = 5) -> str:
"""总结关于特定主题的学术研究"""
# 处理文档(如果尚未处理)
if not self.vectorstore.get():
self.processor.process()
# 检索相关文献
related_docs = self.vectorstore.similarity_search(topic, k=limit)
# 异步生成总结
async def summarize_doc(doc: Document) -> str:
prompt = f"""
请总结以下学术论文片段,重点关注与"{topic}"相关的研究发现:
{doc.page_content}
请用简洁的语言总结核心观点、方法和结论:
"""
# 调用LLM生成总结
return OpenAI(temperature=0)(prompt)
# 并行处理所有文档
tasks = [summarize_doc(doc) for doc in related_docs]
summaries = await asyncio.gather(*tasks)
# 整合所有总结
combined_summary = "\n\n".join([
f"论文总结 {i+1}:\n{summary}"
for i, summary in enumerate(summaries)
])
# 生成总体概述
overview_prompt = f"""
请基于以下关于"{topic}"的学术论文总结,生成一个总体概述:
{combined_summary}
请包括主要研究方向、关键发现和未来研究方向:
"""
return OpenAI(temperature=0)(overview_prompt)
十一、挑战与未来发展方向
11.1 当前挑战
尽管批量化与并行化技术带来了显著性能提升,但仍面临以下挑战:
- 内存管理难题:在处理大规模文档时,内存使用可能成为瓶颈。特别是在并行处理多个大型文档时,可能导致内存溢出。
- 任务调度复杂度:随着并行度提高,任务调度变得更加复杂。如何合理分配资源、平衡负载成为关键问题。
- 错误传播风险:在高度并行的系统中,一个组件的错误可能会快速传播,影响整个处理流程。
- 结果一致性保障:批处理和并行处理可能导致结果顺序不一致,需要额外机制确保结果的逻辑一致性。
11.2 技术发展趋势
未来,文档处理的批量化与并行化将向以下方向发展:
- AI优化执行计划:利用AI技术自动分析文档特征和处理需求,生成最优的批处理和并行化执行计划。
- 异构计算协同:充分利用CPU、GPU、TPU等多种计算资源的优势,实现更高效的混合并行计算。
- 流式处理与实时分析:结合流式计算技术,实现文档的实时处理和分析,满足实时性要求高的应用场景。
- 联邦学习与隐私保护:在保证数据隐私的前提下,实现跨组织、跨平台的文档协同处理和分析。
11.3 最佳实践建议
基于对LangChain文档处理批量化与并行化的深入分析,提出以下最佳实践建议:
- 根据任务特性选择并行策略:对于I/O密集型任务,优先使用多线程;对于计算密集型任务,考虑使用多进程或GPU加速。
- 优化批处理大小:通过性能测试确定最佳批处理大小,平衡处理效率和内存使用。
- 实现容错机制:在批处理和并行处理中加入错误捕获、重试和断点续传机制,确保处理的可靠性。
- 监控与调优:建立完善的性能监控体系,实时监控处理过程,并根据监控结果动态调整并行度和批处理大小。
- 利用云服务扩展:对于大规模文档处理需求,考虑利用云服务的弹性计算能力实现分布式处理。
通过遵循这些最佳实践,可以充分发挥LangChain文档处理批量化与并行化的优势,高效处理大规模文档数据,为各类应用提供强大支持。