彻底搞懂Onyx文档处理：Unstructured库文本提取与智能分块实战-优快云博客

彻底搞懂Onyx文档处理：Unstructured库文本提取与智能分块实战

【免费下载链接】danswer Ask Questions in natural language and get Answers backed by private sources. Connects to tools like Slack, GitHub, Confluence, etc. 项目地址: https://gitcode.com/GitHub_Trending/da/danswer

你是否还在为处理多种格式的文档而烦恼？面对PDF、DOCX、PPTX等不同类型的文件，如何高效提取文本并进行智能分块？本文将带你深入了解Onyx文档处理的核心技术，通过Unstructured库实现文本提取，并结合分块策略提升文档处理效率。读完本文，你将掌握：

Onyx文档处理的基本流程
Unstructured库的多格式文本提取方法
智能分块技术的实现与优化
实际案例分析与应用场景

Onyx文档处理概述

Onyx作为danswer项目的核心模块，提供了强大的文档处理能力，支持从多种来源（如Slack、GitHub、Confluence等）提取和处理文本。其核心功能包括文件类型识别、文本提取、内容分块等，为后续的自然语言处理和问答系统提供高质量的数据源。

Onyx的文档处理主要依赖于两个关键模块：

文本提取：基于Unstructured库，支持多种文件格式的文本提取
智能分块：采用SentenceChunker实现基于语义的文本分块

Unstructured库文本提取技术

Unstructured库是Onyx文档处理的核心依赖，它提供了统一的接口来处理各种非结构化数据。Onyx通过封装Unstructured库，实现了对多种文件格式的文本提取。

支持的文件格式

Onyx支持提取文本的文件格式包括：

文本文件：.txt, .md, .csv, .json等
办公文档：.docx, .pptx, .xlsx
PDF文件：支持加密PDF的解密与文本提取
邮件格式：.eml
网页内容：.html
电子书：.epub

具体支持的文件扩展名定义在extract_file_text.py中：

ACCEPTED_PLAIN_TEXT_FILE_EXTENSIONS = [
    ".txt", ".md", ".mdx", ".conf", ".log", ".json", 
    ".csv", ".tsv", ".xml", ".yml", ".yaml", ".sql"
]

ACCEPTED_DOCUMENT_FILE_EXTENSIONS = [
    ".pdf", ".docx", ".pptx", ".xlsx", ".eml", ".epub", ".html"
]

文本提取实现流程

Onyx的文本提取流程主要通过extract_file_text函数实现，位于extract_file_text.py：

文件类型检测：根据文件扩展名或MIME类型确定文件格式
提取策略选择：为不同文件类型选择合适的提取方法
文本提取执行：调用相应的提取函数提取文本内容
错误处理与回退：处理提取失败的情况，提供备选方案

核心代码片段：

def extract_file_text(
    file: IO[Any],
    file_name: str,
    break_on_unprocessable: bool = True,
    extension: str | None = None,
) -> str:
    extension_to_function: dict[str, Callable[[IO[Any]], str]] = {
        ".pdf": pdf_to_text,
        ".docx": lambda f: docx_to_text_and_images(f, file_name)[0],
        ".pptx": lambda f: pptx_to_text(f, file_name),
        ".xlsx": lambda f: xlsx_to_text(f, file_name),
        ".eml": eml_to_text,
        ".epub": epub_to_text,
        ".html": parse_html_page_basic,
    }
    
    # 尝试使用Unstructured API提取文本
    if get_unstructured_api_key():
        try:
            return unstructured_to_text(file, file_name)
        except Exception as unstructured_error:
            logger.error(
                f"Failed to process with Unstructured: {str(unstructured_error)}. Falling back to normal processing."
            )
    
    # 根据文件扩展名选择对应的提取函数
    if extension is None:
        extension = get_file_ext(file_name)
    
    if is_accepted_file_ext(extension, OnyxExtensionType.Plain | OnyxExtensionType.Document):
        func = extension_to_function.get(extension, file_io_to_text)
        file.seek(0)
        return func(file)
    
    # 如果无法识别文件类型，尝试作为文本文件处理
    file.seek(0)
    if is_text_file(file):
        return file_io_to_text(file)
    
    raise ValueError("Unknown file extension or not recognized as text data")

高级特性：图像提取与处理

Onyx不仅支持文本提取，还能处理文档中的嵌入式图像。以DOCX文件为例，Onyx可以提取其中的图像并保存：

相关实现位于docx_to_text_and_images函数：

def docx_to_text_and_images(
    file: IO[Any],
    file_name: str = "",
    image_callback: Callable[[bytes, str], None] | None = None,
) -> tuple[str, Sequence[tuple[bytes, str]]]:
    # 提取文本内容
    md = get_markitdown_converter()
    doc = md.convert(to_bytesio(file), stream_info=StreamInfo(mimetype=WORD_PROCESSING_MIME_TYPE))
    
    # 提取嵌入式图像
    if image_callback is None:
        return doc.markdown, list(extract_docx_images(to_bytesio(file)))
    
    # 如果提供了回调函数，流式处理图像
    for img_file_bytes, img_file_name in extract_docx_images(to_bytesio(file)):
        image_callback(img_file_bytes, img_file_name)
    
    return doc.markdown, []

智能分块技术实现

文本分块是将长文档分割为语义连贯的短片段的过程，这对于后续的检索和问答至关重要。Onyx采用基于SentenceChunker的智能分块策略，确保分块边界符合语义逻辑。

分块策略概述

Onyx的分块策略主要考虑以下因素：

语义连贯性：基于句子边界进行分割，避免在语义单元中间分割
长度控制：确保分块大小在设定的token限制内
元数据保留：在分块中保留必要的文档元数据
层次结构：支持文档的层次化分块，如章节、段落等

分块实现核心代码

分块功能主要由chunker.py中的Chunker类实现：

class Chunker:
    def __init__(
        self,
        tokenizer: BaseTokenizer,
        enable_multipass: bool = False,
        enable_large_chunks: bool = False,
        enable_contextual_rag: bool = False,
        blurb_size: int = BLURB_SIZE,
        include_metadata: bool = not SKIP_METADATA_IN_CHUNK,
        chunk_token_limit: int = DOC_EMBEDDING_CONTEXT_SIZE,
        chunk_overlap: int = CHUNK_OVERLAP,
        mini_chunk_size: int = MINI_CHUNK_SIZE,
        callback: IndexingHeartbeatInterface | None = None,
    ) -> None:
        self.include_metadata = include_metadata
        self.chunk_token_limit = chunk_token_limit
        self.enable_multipass = enable_multipass
        self.enable_large_chunks = enable_large_chunks
        self.enable_contextual_rag = enable_contextual_rag
        self.tokenizer = tokenizer
        
        # 创建分块器实例
        def token_counter(text: str) -> int:
            return len(tokenizer.encode(text))
            
        self.chunk_splitter = SentenceChunker(
            tokenizer_or_token_counter=token_counter,
            chunk_size=chunk_token_limit,
            chunk_overlap=chunk_overlap,
            return_type="texts",
        )

分块流程详解

Onyx的分块流程主要包括以下步骤：

文档预处理：提取标题和元数据，为分块做准备
内容分块：使用SentenceChunker进行初步分块
大小调整：确保分块大小在设定范围内
元数据附加：为每个分块添加必要的元数据
层次化分块：可选创建大分块，支持多级检索

核心分块逻辑位于_chunk_document_with_sections方法：

def _chunk_document_with_sections(
    self,
    document: IndexingDocument,
    sections: list[Section],
    title_prefix: str,
    metadata_suffix_semantic: str,
    metadata_suffix_keyword: str,
    content_token_limit: int,
) -> list[DocAwareChunk]:
    chunks: list[DocAwareChunk] = []
    link_offsets: dict[int, str] = {}
    chunk_text = ""
    
    for section_idx, section in enumerate(sections):
        section_text = clean_text(str(section.text or ""))
        section_link_text = section.link or ""
        image_url = section.image_file_id
        
        # 如果有图像，单独创建一个分块
        if image_url:
            if chunk_text.strip():
                self._create_chunk(document, chunks, chunk_text, link_offsets, 
                                  title_prefix=title_prefix, 
                                  metadata_suffix_semantic=metadata_suffix_semantic,
                                  metadata_suffix_keyword=metadata_suffix_keyword)
                chunk_text = ""
                link_offsets = {}
            
            self._create_chunk(document, chunks, section_text, {0: section_link_text},
                              image_file_id=image_url, title_prefix=title_prefix,
                              metadata_suffix_semantic=metadata_suffix_semantic,
                              metadata_suffix_keyword=metadata_suffix_keyword)
            continue
        
        # 检查当前分块添加新内容后是否超出限制
        section_token_count = len(self.tokenizer.encode(section_text))
        current_token_count = len(self.tokenizer.encode(chunk_text))
        next_section_tokens = len(self.tokenizer.encode(SECTION_SEPARATOR)) + section_token_count
        
        if next_section_tokens + current_token_count <= content_token_limit:
            # 可以添加到当前分块
            if chunk_text:
                chunk_text += SECTION_SEPARATOR
            chunk_text += section_text
            link_offsets[len(shared_precompare_cleanup(chunk_text)) - len(section_text)] = section_link_text
        else:
            # 当前分块已满，创建新分块
            self._create_chunk(document, chunks, chunk_text, link_offsets, 
                              title_prefix=title_prefix,
                              metadata_suffix_semantic=metadata_suffix_semantic,
                              metadata_suffix_keyword=metadata_suffix_keyword)
            link_offsets = {0: section_link_text}
            chunk_text = section_text
    
    # 添加最后一个分块
    if chunk_text.strip() or not chunks:
        self._create_chunk(document, chunks, chunk_text, link_offsets or {0: ""},
                          title_prefix=title_prefix,
                          metadata_suffix_semantic=metadata_suffix_semantic,
                          metadata_suffix_keyword=metadata_suffix_keyword)
    
    return chunks

多级分块：小分块与大分块

Onyx支持多级分块策略，除了基础的语义分块外，还可以创建更大的组合分块，以支持不同粒度的检索需求：

def generate_large_chunks(chunks: list[DocAwareChunk], large_chunk_id: int) -> DocAwareChunk:
    """
    将多个小分块组合成一个大分块，用于"多通道"检索模式
    """
    merged_chunk = DocAwareChunk(
        source_document=chunks[0].source_document,
        chunk_id=chunks[0].chunk_id,
        blurb=chunks[0].blurb,
        content=chunks[0].content,
        source_links=chunks[0].source_links or {},
        image_file_id=None,
        section_continuation=(chunks[0].chunk_id > 0),
        title_prefix=chunks[0].title_prefix,
        metadata_suffix_semantic=chunks[0].metadata_suffix_semantic,
        metadata_suffix_keyword=chunks[0].metadata_suffix_keyword,
        large_chunk_reference_ids=[chunk.chunk_id for chunk in chunks],
        mini_chunk_texts=None,
        large_chunk_id=large_chunk_id,
        chunk_context="",
        doc_summary="",
        contextual_rag_reserved_tokens=0,
    )
    
    offset = 0
    for i in range(1, len(chunks)):
        merged_chunk.content += SECTION_SEPARATOR + chunks[i].content
        offset += len(SECTION_SEPARATOR) + len(chunks[i-1].content)
        
        for link_offset, link_text in (chunks[i].source_links or {}).items():
            if merged_chunk.source_links is None:
                merged_chunk.source_links = {}
            merged_chunk.source_links[link_offset + offset] = link_text
    
    return merged_chunk

实际应用案例

案例1：PDF文档处理

Onyx处理PDF文档的流程包括：

解密（如果PDF加密）
提取文本内容
提取元数据（标题、作者等）
可选提取图像

相关代码位于read_pdf_file函数：

def read_pdf_file(
    file: IO[Any],
    pdf_pass: str | None = None,
    extract_images: bool = False,
    image_callback: Callable[[bytes, str], None] | None = None,
) -> tuple[str, dict[str, Any], Sequence[tuple[bytes, str]]]:
    from pypdf import PdfReader
    
    metadata: dict[str, Any] = {}
    extracted_images: list[tuple[bytes, str]] = []
    
    try:
        pdf_reader = PdfReader(file)
        
        # 处理加密PDF
        if pdf_reader.is_encrypted and pdf_pass is not None:
            decrypt_success = pdf_reader.decrypt(pdf_pass) != 0
            if not decrypt_success:
                return "", metadata, []
        
        # 提取元数据
        if pdf_reader.metadata is not None:
            for key, value in pdf_reader.metadata.items():
                clean_key = key.lstrip("/")
                if isinstance(value, str) and value.strip():
                    metadata[clean_key] = value
        
        # 提取文本内容
        text = TEXT_SECTION_SEPARATOR.join(page.extract_text() for page in pdf_reader.pages)
        
        # 提取图像（如果启用）
        if extract_images:
            for page_num, page in enumerate(pdf_reader.pages):
                for image_file_object in page.images:
                    # 处理图像...
        
        return text, metadata, extracted_images
    
    except Exception as e:
        logger.exception("Failed to read PDF")
        return "", metadata, []

案例2：Excel表格处理

Onyx处理Excel文件时，会将每个工作表转换为文本格式，并保留表格结构：

相关代码位于xlsx_to_text函数：

def xlsx_to_text(file: IO[Any], file_name: str = "") -> str:
    try:
        workbook = openpyxl.load_workbook(file, read_only=True)
    except Exception as e:
        logger.warning(f"Failed to extract text from {file_name or 'xlsx file'}: {e}")
        return ""
    
    text_content = []
    for sheet in workbook.worksheets:
        rows = []
        num_empty_consecutive_rows = 0
        
        for row in sheet.iter_rows(min_row=1, values_only=True):
            row_str = ",".join(str(cell or "") for cell in row)
            
            if any(cell is not None and str(cell).strip() for cell in row):
                rows.append(row_str)
                num_empty_consecutive_rows = 0
            else:
                num_empty_consecutive_rows += 1
                if num_empty_consecutive_rows > 100:
                    break
        
        sheet_str = "\n".join(rows)
        text_content.append(sheet_str)
    
    return TEXT_SECTION_SEPARATOR.join(text_content)

性能优化与最佳实践

分块大小优化

Onyx的分块大小可以通过配置调整，默认值定义在app_configs.py：

# 分块大小配置
BLURB_SIZE = 100  # 摘要大小（tokens）
MINI_CHUNK_SIZE = 200  # 小分块大小（tokens）
DOC_EMBEDDING_CONTEXT_SIZE = 1000  # 文档嵌入上下文大小（tokens）
LARGE_CHUNK_RATIO = 3  # 大分块包含的小分块数量

建议根据实际应用场景调整这些参数：

知识库检索：建议使用较小的分块（200-300 tokens）
文档摘要：建议使用较大的分块（1000+ tokens）
混合场景：启用多级分块策略

内存优化

处理大型文档或批量文档时，可以采用以下优化措施：

使用流式处理避免加载整个文件到内存
启用图像回调函数，避免缓存大量图像数据
处理完成后显式释放资源

总结与展望

Onyx文档处理模块通过Unstructured库和智能分块技术，为danswer项目提供了强大的文档理解能力。其核心优势包括：

多格式支持：处理20+种常见文件格式
智能分块：基于语义的文本分块，提升检索精度
图像提取：支持文档中嵌入式图像的提取与处理
可扩展性：模块化设计，易于添加新的文件格式支持

未来，Onyx计划引入更多高级特性：

多语言支持：增强对非英语文档的处理能力
OCR集成：支持从扫描文档中提取文本
结构保留：保留文档的原始结构信息，提升问答质量
性能优化：提升大文档处理速度和内存效率

通过Onyx的文档处理能力，danswer能够为用户提供更精准、更全面的问答体验，无论是处理Slack消息、GitHub代码还是Confluence文档，都能游刃有余。

希望本文能帮助你更好地理解Onyx的文档处理技术。如有任何问题或建议，欢迎通过项目Issue进行交流。

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考