构建LangChain应用程序的示例代码：53、利用多模态大型语言模型在RAG应用中处理混合文档的示例

许多文档包含多种内容类型，包括文本和图像。

然而，在大多数 RAG 应用中，图像中捕获的信息都会丢失。

随着多模态LLMs的出现，比如GPT-4V，如何在RAG中利用图像是值得考虑的。

本篇指南的亮点是：

使用非结构化来解析文档 (PDF) 中的图像、文本和表格。
使用多模态嵌入（例如 CLIP）来嵌入图像和文本
使用 VDMS 作为支持多模式的矢量存储
使用相似性搜索检索图像和文本
将原始图像和文本块传递到多模式 LLM 以进行答案合成

Packages

对于 unstructured ，您的系统中还需要 poppler （安装说明）和 tesseract （安装说明）。

# (newest versions required for multi-modal)
! pip install --quiet -U vdms langchain-experimental

# lock to 0.10.19 due to a persistent bug in more recent versions
! pip install --quiet pdf2image "unstructured[all-docs]==0.10.19" pillow pydantic lxml open_clip_torch

启动VDMS服务器

让我们使用端口 55559 而不是默认的 55555 启动 VDMS docker。记下端口和主机名，因为矢量存储需要这些端口和主机名，因为它使用 VDMS Python 客户端连接到服务器。

! docker run --rm -d -p 55559:55555 --name vdms_rag_nb intellabs/vdms:latest

# Connect to VDMS Vector Store
from langchain_community.vectorstores.vdms import VDMS_Client

vdms_client = VDMS_Client(port=55559)

docker: Error response from daemon: Conflict. The container name
“/vdms_rag_nb” is already in use by container
“0c19ed281463ac10d7efe07eb815643e3e534ddf24844357039453ad2b0c27e8”.
You have to remove (or rename) that container to be able to reuse that
name. See ‘docker run --help’.

# from dotenv import load_dotenv, find_dotenv
# load_dotenv(find_dotenv(), override=True);

数据加载

分割 PDF 文本和图像

让我们看一个包含有趣图像的 pdf 示例。

国会图书馆的著名照片：

https://www.loc.gov/lcm/pdf/LCM_2020_1112.pdf
我们将在下面使用它作为示例

我们可以使用下面的 Unstructed 中的 partition_pdf 来提取文本和图像。

from pathlib import Path

import requests

# Folder with pdf and extracted images
datapath = Path("./multimodal_files").resolve()
datapath.mkdir(parents=True, exist_ok=True)

pdf_url = "https://www.loc.gov/lcm/pdf/LCM_2020_1112.pdf"
pdf_path = str(datapath / pdf_url.split("/")[-1])
with open(pdf_path, "wb") as f:
    f.write(requests.get(pdf_url).content)

# Extract images, tables, and chunk text
from unstructured.partition.pdf import partition_pdf

raw_pdf_elements = partition_pdf(
    filename=pdf_path,
    extract_images_in_pdf=True,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=datapath,
)

datapath = str(datapath)

# Categorize text elements by type
tables =