NVIDIA Ingest 项目安装与配置指南-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00895/article/details/147085041

NVIDIA Ingest 项目安装与配置指南

nv-ingest NVIDIA Ingest is an early access set of microservices for parsing hundreds of thousands of complex, messy unstructured PDFs and other enterprise documents into metadata and text to embed into retrieval systems. 项目地址: https://gitcode.com/gh_mirrors/nv/nv-ingest

1. 项目基础介绍

NVIDIA Ingest 是一个开源项目，旨在提供一套用于解析和处理复杂无结构 PDF 文档和其他企业文档的微服务。该项目可以将文档内容解析为元数据和文本，以便嵌入到检索系统中。NVIDIA Ingest 支持多种文件格式的解析，包括 PDF、文本文件、Word 文档、PowerPoint 文档、图片以及音频文件等。项目主要使用 Python 编程语言。

2. 关键技术和框架

该项目使用了一系列关键技术，包括但不限于：

PDF 解析: 支持多种 PDF 解析技术，如 pdfium、Unstructured.io 和 Adobe Content Extraction Services。
光学字符识别 (OCR): 用于从图像中提取文本。
嵌入计算: 计算提取内容的嵌入，以便于后续应用。
向量数据库: 如 Milvus，用于存储和检索嵌入。
微服务架构: 使用容器化技术如 Docker 和 Kubernetes 进行部署。

3. 安装和配置准备工作

在开始安装 NVIDIA Ingest 之前，请确保您已经满足了以下前提条件：

操作系统: 建议使用 Ubuntu 22.04 或更高版本的 Linux 操作系统。
Python 环境: 需要安装 Conda Python 环境和包管理器。
Python 版本: 需要安装 Python 3.10。

安装步骤

创建一个新的 Conda 环境，并安装所需的依赖项：

conda create -y --name nvingest python=3.10
conda activate nvingest
conda install -y -c rapidsai -c conda-forge -c nvidia nv_ingest=25.3.0 nv_ingest_client=25.3.0 nv_ingest_api=25.3.0
pip install opencv-python llama-index-embeddings-nvidia pymilvus==2.5.4 pymilvus[bulk_writer, model] milvus-lite nvidia-riva-client unstructured-client

设置环境变量 NVIDIA_BUILD_API_KEY 和 NVIDIA_API_KEY。如果没有这些密钥，可以在 build.nvidia.com 获取：
```
export NVIDIA_BUILD_API_KEY=nvapi-...
export NVIDIA_API_KEY=nvapi-...
```

运行您的文档解析脚本。以下是一个示例脚本：

import logging
import os
import time
import sys
from nv_ingest.util.pipeline.pipeline_runners import start_pipeline_subprocess
from nv_ingest_client.client import Ingestor, NvIngestClient
from nv_ingest_client.message_clients.simple.simple_client import SimpleClient
from nv_ingest.util.pipeline.pipeline_runners import PipelineCreationSchema
from nv_ingest_client.util.process_json_files import ingest_json_results_to_blob

# 启动 pipeline 子进程
config = PipelineCreationSchema()
pipeline_process = start_pipeline_subprocess(config)

client = NvIngestClient(
    message_client_allocator=SimpleClient,
    message_client_port=7671,
    message_client_hostname="localhost"
)

milvus_uri = "milvus.db"
collection_name = "test"
sparse = False

# 文件内容提取
ingestor = (
    Ingestor(
        client=client
    )
    .files(
        "data/multimodal_test.pdf"
    )
    .extract(
        extract_text=True,
        extract_tables=True,
        extract_charts=True,
        extract_images=True,
        paddle_output_format="markdown",
        extract_infographics=True
    )
    .embed()
    .vdb_upload(
        collection_name=collection_name,
        milvus_uri=milvus_uri,
        sparse=sparse,
        dense_dim=2048
    )
)

print("Starting ingestion..")
t0 = time.time()
results = ingestor.ingest(show_progress=True)
t1 = time.time()

print(f"Time taken: {t1 - t0} seconds")
print(ingest_json_results_to_blob(results[0]))

请按照上述步骤操作，完成 NVIDIA Ingest 的安装和配置。如果遇到任何问题，请参考项目的官方文档或社区支持。

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考