构建LangChain应用程序的示例代码:52、如何使用 Chroma 和 Google 实现多模态检索增强生成 (RAG)

Setup 设置

安装所需的依赖项,并为您的 Google 服务创建 API 密钥。

%pip install -U --quiet langchain langchain_community openai chromadb langchain-experimental
%pip install --quiet "unstructured[all-docs]" pypdf pillow pydantic lxml pillow matplotlib chromadb tiktoken

数据加载

我们使用一个 zip 文件,其中包含从本博客文章中提取的图像和 pdf 的子集。如果您想遵循完整流程,请使用原始示例。

# First download
import logging
import zipfile

import requests

logging.basicConfig(level=logging.INFO)

data_url = "https://storage.googleapis.com/benchmarks-artifacts/langchain-docs-benchmarking/cj.zip"
result = requests.get(data_url)
filename = "cj.zip"
with open(filename, "wb") as file:
    file.write(result.content)

with zipfile.ZipFile(filename, "r") as zip_ref:
    zip_ref.extractall()
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("./cj/cj.pdf")
docs = loader.load()
tables = []
texts = [d.page_content for d in docs]
len(texts)

21

多向量检索器

让我们生成文本和图像摘要并将它们保存到 ChromaDB 矢量存储中。

from langchain.prompts import PromptTemplate
from langchain_community.chat_models import ChatVertexAI
from langchain_community.llms import VertexAI
from langchain_core.messages import AIMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda


# Generate summaries of text elements
def generate_text_summaries(texts, tables, summarize_texts=False):
    """
    Summarize text elements
    texts: List of str
    tables: List of str
    summarize_texts: Bool to summarize texts
    """

    # Prompt
    prompt_text = """You are an assistant tasked with summarizing tables and text for retrieval. \
    These summaries will be embedded and used to retrieve the raw text or table elements. \
    Give a concise summary of the table or text that is well optimized for retrieval. Table or text: {element} """
    prompt = PromptTemplate.from_template(prompt_text)
    empty_response = RunnableLambda(
        lambda x: AIMessage(content="Error processing document")
    )
    # Text summary chain
    model = VertexAI(
        temperature=0, model_name="gemini-pro", max_tokens=1024
    ).with_fallbacks([empty_response])
    summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

    # Initialize empty summaries
    text_summaries = []
    table_summaries = []

    # Apply to text if texts are provided and summarization is requested
    if texts and summarize_texts:
        text_summaries = summarize_chain.batch(texts, {"max_concurrency": 1})
    elif texts:
        text_summaries = texts

    # Apply to tables if tables are provided
    if tables:
        table_summaries = summarize_chain.batch(tables, {"max_concurrency": 1})

    return text_summaries, table_summaries


# Get text, table summaries
text_summaries, table_summaries = generate_text_summaries(
    texts, tables, summarize_texts=True
)
len(text_summaries)

21

import base64
import os

from langchain_core.messages import HumanMessage


def encode_image(image_path):
    """Getting the base64 string"""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


def image_summarize(img_base64, prompt):
    """Make image summary"""
    model = ChatVertexAI(model="gemini-pro-vision", max_tokens=1024)

    msg = model.invoke(
        [
            HumanMessage(
                content=[
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
                    },
                ]
            )
        ]
    )
    return msg.content


def generate_img_summaries(path):
    """
    Generate summaries and base64 encoded strings for images
    path: Path to list of .jpg files extracted by Unstructured
    """

    # Store base64 encoded images
    img_base64_list = []

    # Store image summaries
    image_summaries = []

    # Prompt
    prompt = """You are an assistant tasked with summarizing images for retrieval. \
    These summaries will be embedded and used to retrieve the raw image. \
    Give a concise summary of the image that is well optimized for retrieval."""

    # Apply to images
    for img_file in sorted(os.listdir(path)):
        if img_file.endswith(".jpg"):
            img_path = os.path.join(path, img_file)
            base64_image = encode_image(img_path)
            img_base64_list.append(base64_image)
            image_summaries.append(image_summarize(base64_image, prompt))

    return img_base64_list, image_summaries


# Image summaries
img_base64_list, image_summaries = generate_img_summaries("./cj")
len(image_summaries)

5

import uuid

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_community.embeddings import VertexAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document


def create_multi_vector_retriever(
    vectorstore, text_summaries, texts, table_summaries, tables, image_summaries, images
):
    """
    Create retriever that indexes summaries, but returns raw images or texts
    """

    # Initialize the storage layer
    store = InMemoryStore()
    id_key = "doc_id"

    # Create the multi-vector retriever
    retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=store,
        id_key=id_key,
    )

    # Helper function to add documents to the vectorstore and docstore
    def add_documents(retriever, doc_summaries, doc_contents):
        doc_ids = [str(uuid.uuid4()) for _ in doc_contents]
        summary_docs = [
            Document(page_content=s, metadata={id_key: doc_ids[i]})
            for i, s in enumerate(doc_summaries)
        ]
        retriever.vectorstore.add_documents(summary_docs)
        retriever.docstore.mset(list(zip(doc_ids, doc_contents)))

    # Add texts, tables, and images
    # Check that text_summaries is not empty before adding
    if text_summaries:
        add_documents(retriever, text_summaries, texts)
    # Check that table_summaries is not empty before adding
    if table_summaries:
        add_documents(retriever, table_summaries, tables)
    # Check that image_summaries is not empty before adding
    if image_summaries:
        add_documents(retriever, image_summaries, images)

    return retriever


# The vectorstore to use to index the summaries
vectorstore = Chroma(
    collection_name="mm_rag_cj_blog",
    embedding_function=VertexAIEmbeddings(model_name="textembedding-gecko@latest"),
)

# Create retriever
retriever_multi_vector_img = create_multi_vector_retriever(
    vectorstore,
    text_summaries,
    texts,
    table_summaries,
    tables,
    image_summaries,
    img_base64_list,
)

构建 RAG

让我们构建一个检索器:

import io
import re

from IPython.display import HTML, display
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from PIL import Image


def plt_img_base64(img_base64):
    """Disply base64 encoded string as image"""
    # Create an HTML img tag with the base64 string as the source
    image_html = f'<img src="data:image/jpeg;base64,{img_base64}" />'
    # Display the image by rendering the HTML
    display(HTML(image_html))


def looks_like_base64(sb):
    """Check if the string looks like base64"""
    return re.match("^[A-Za-z0-9+/]+[=]{0,2}$", sb) is not None


def is_image_data(b64data):
    """
    Check if the base64 data is an image by looking at the start of the data
    """
    image_signatures = {
        b"\xff\xd8\xff": "jpg",
        b"\x89\x50\x4e\x47\x0d\x0a\x1a\x0a": "png",
        b"\x47\x49\x46\x38": "gif",
        b"\x52\x49\x46\x46": "webp",
    }
    try:
        header = base64.b64decode(b64data)[:8]  # Decode and get the first 8 bytes
        for sig, format in image_signatures.items():
            if header.startswith(sig):
                return True
        return False
    except Exception:
        return False


def resize_base64_image(base64_string, size=(128, 128)):
    """
    Resize an image encoded as a Base64 string
    """
    # Decode the Base64 string
    img_data = base64.b64decode(base64_string)
    img = Image.open(io.BytesIO(img_data))

    # Resize the image
    resized_img = img.resize(size, Image.LANCZOS)

    # Save the resized image to a bytes buffer
    buffered = io.BytesIO()
    resized_img.save(buffered, format=img.format)

    # Encode the resized image to Base64
    return base64.b64encode(buffered.getvalue()).decode("utf-8")


def split_image_text_types(docs):
    """
    Split base64-encoded images and texts
    """
    b64_images = []
    texts = []
    for doc in docs:
        # Check if the document is of type Document and extract page_content if so
        if isinstance(doc, Document):
            doc = doc.page_content
        if looks_like_base64(doc) and is_image_data(doc):
            doc = resize_base64_image(doc, size=(1300, 600))
            b64_images.append(doc)
        else:
            texts.append(doc)
    if len(b64_images) > 0:
        return {"images": b64_images[:1], "texts": []}
    return {"images": b64_images, "texts": texts}


def img_prompt_func(data_dict):
    """
    Join the context into a single string
    """
    formatted_texts = "\n".join(data_dict["context"]["texts"])
    messages = []

    # Adding the text for analysis
    text_message = {
        "type": "text",
        "text": (
            "You are financial analyst tasking with providing investment advice.\n"
            "You will be given a mixed of text, tables, and image(s) usually of charts or graphs.\n"
            "Use this information to provide investment advice related to the user question. \n"
            f"User-provided question: {data_dict['question']}\n\n"
            "Text and / or tables:\n"
            f"{formatted_texts}"
        ),
    }
    messages.append(text_message)
    # Adding image(s) to the messages if present
    if data_dict["context"]["images"]:
        for image in data_dict["context"]["images"]:
            image_message = {
                "type": "image_url",
                "image_url": {"url": f"data:image/jpeg;base64,{image}"},
            }
            messages.append(image_message)
    return [HumanMessage(content=messages)]


def multi_modal_rag_chain(retriever):
    """
    Multi-modal RAG chain
    """

    # Multi-modal LLM
    model = ChatVertexAI(temperature=0, model_name="gemini-pro-vision", max_tokens=1024)

    # RAG pipeline
    chain = (
        {
            "context": retriever | RunnableLambda(split_image_text_types),
            "question": RunnablePassthrough(),
        }
        | RunnableLambda(img_prompt_func)
        | model
        | StrOutputParser()
    )

    return chain


# Create RAG chain
chain_multimodal_rag = multi_modal_rag_chain(retriever_multi_vector_img)

让我们检查一下我们是否将图像作为文档获取:

query = "What are the EV / NTM and NTM rev growth for MongoDB, Cloudflare, and Datadog?"
docs = retriever_multi_vector_img.invoke(query, limit=1)

# We get 2 docs
len(docs)

4

plt_img_base64(docs[0])

在这里插入图片描述

让我们对同一查询运行 RAG:

chain_multimodal_rag.invoke(query)

’ | Company | EV / NTM Rev | NTM Rev Growth |\n|—|—|—|\n|
MongoDB | 14.6x | 17% |\n| Cloudflare | 13.4x | 28% |\n| Datadog |
13.1x | 19% |’

正如我们所看到的,该模型能够找出与回答问题相关的正确值。

总结

本文件详细介绍了如何使用 Chroma 和 Google 实现多模态检索增强生成 (RAG)。主要内容包括系统的整体架构、关键组件、代码实现以及应用示例。文中展示了如何将文本和图像数据结合,利用检索技术增强生成模型的性能。具体代码部分提供了详细的实现步骤,并辅以注释以帮助理解。

<think>嗯,用户想设计一个基于DeepSeek的校园助手,使用LangChain框架,要求本地部署,根据校园场景公共数据集微调,构建本校向量数据库,实现RAGAgent功能。这需要分步骤来考虑。 首先,我需要确认各个组件的实现方式。本地部署的话,可能需要用到本地服务器或者云计算资源,但用户强调本地,所以应该用物理服务器或本地计算机。然后,微调模型需要校园相关的数据集,可能需要收集公开的数据,比如课程表、校历、常见问题等。可能需要找一些现有的公开数据集,或者自己构建。 接下来是构建向量数据库。这里需要文本的嵌入模型,比如Hugging Face的模型,或者OpenAI的,但本地部署的话,可能得用开源的模型,比如sentence-transformers。然后使用向量数据库如FAISS或Chroma,这些都可以本地部署。 RAG功能需要结合检索生成LangChain确实有相关的模块,比如RetrievalQA。需要把向量数据库生成模型结合起来,在回答问题时先检索相关文档,再生成答案。Agent功能的话,LangChain的Agent模块可以用工具来执行任务,比如查询天气、课表,或者调用其他API。 模型微调部分,可能需要用DeepSeek的基础模型,比如DeepSeek-R1,然后在校园数据集上进行微调。数据预处理是关键,要清洗数据,格式化为适合训练的格式,比如JSON或文本文件。训练时要注意过拟合问题,可能需要早停交叉验证。 然后是系统架构设计,前端可能需要一个简单的聊天界面,可以用Gradio或Streamlit快速搭建。后端用LangChain处理逻辑,连接向量数据库微调后的模型。数据库方面,向量数据库传统数据库可能需要结合使用,比如MySQL存储结构化数据,FAISS存储向量。 安全性方面,本地部署的好处是数据不外泄,但也要考虑访问控制数据加密。用户权限管理可能需要简单的认证机制。 测试优化部分,需要收集真实用户的反馈,持续调整模型检索效果。可能遇到的问题包括数据不足导致微调效果差,或者检索不准确影响RAG的效果。这时候需要扩充数据集或调整检索策略。 最后,部署维护需要考虑系统的持续运行,日志监控定期更新数据集模型。可能还要处理并发请求,优化响应速度。 需要确认用户是否有足够的技术资源,比如服务器、GPU资源来做训练部署。如果用户是学校的技术团队,可能具备这些条件,否则可能需要简化方案,比如使用轻量级模型数据库。</think>好的!以下是基于DeepSeek构建本地化校园助手的实现方案,分为6个关键步骤,结合LangChain框架特性进行技术解析: --- ### 一、技术选型与架构设计 ```mermaid graph TD A[前端交互层] --> B[LangChain逻辑层] B --> C1[微调DeepSeek模型] B --> C2[向量数据库] B --> C3[Agent工具链] C2 --> D[校园知识库] C3 --> E[课表/地图API] ``` --- ### 二、核心模块实现步骤 #### 1. **数据准备与模型微调** - **数据集构建** - 公开数据集:使用[EDU-CampusQA](https://github.com/EDU-Datasets/CampusQA)等校园场景对话数据 - 本校数据补充: ```python # 示例数据格式 { "question": "图书馆开放时间是?", "answer": "工作日8:00-22:00,周末9:00-21:00", "metadata": {"source": "2024校历"} } ``` - **模型微调** ```bash # 使用DeepSeek-7B基础模型 deepseek-tune --model deepseek-7b-base \ --dataset campus_qa.json \ --lora_r 8 \ --output_dir ./campus_assistant ``` #### 2. **向量数据库建设** - **文档处理流程**: 1. 多格式解析(PDF/Word/网页) 2. LangChain文本分割: ```python from langchain_text_splitters import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter(chunk_size=500) ``` 3. 向量化: ```python from langchain.embeddings import HuggingFaceEmbeddings embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-base-zh") ``` #### 3. RAG增强实现 ```python # LangChain检索示例 from langchain.chains import RetrievalQA retriever = vectorstore.as_retriever() qa_chain = RetrievalQA.from_chain_type( llm=deepseek_model, chain_type="stuff", retriever=retriever ) ``` --- ### 三、Agent功能开发 #### 1. 工具设计 ```python from langchain.tools import tool @tool def get_class_schedule(date: str): """查询指定日期的课程安排""" return query_school_api(date) ``` #### 2. Agent执行器 ```python from langchain.agents import create_react_agent tools = [get_class_schedule, campus_map_navigation] agent = create_react_agent(llm=deepseek_model, tools=tools) ``` --- ### 四、本地部署方案 #### 1. 硬件要求 | 组件 | 最低配置 | 推荐配置 | |-------|---------|---------| | CPU | 8核 | 16核 | | GPU | RTX3090 | A100 | | 内存 | 32GB | 64GB | #### 2. 服务封装 ```bash # 使用FastAPI部署 uvicorn main:app --host 0.0.0.0 --port 8000 \ --ssl-keyfile key.pem --ssl-certfile cert.pem ``` --- ### 五、关键优化策略 1. **混合检索策略**: - 关键词检索 + 向量检索融合 - 查询扩展技术应用 2. **缓存机制**: ```python from langchain.cache import SQLiteCache import langchain langchain.llm_cache = SQLiteCache(database_path=".langchain.db") ``` --- ### 六、安全与维护 1. 数据安全措施: - 全链路本地化部署 - 敏感信息脱敏处理 ```python from presidio_analyzer import AnalyzerEngine analyzer = AnalyzerEngine() ``` 2. 更新机制: - 每周增量更新知识库 - 季度级模型迭代 --- **典型校园场景支持**: 1. 课程咨询(RAG+Agent) 2. 校园导览(多模态交互) 3. 事务办理(工作流集成) 4. 学术支持(论文检索/查重) 需要根据具体学校需求调整知识库结构工具链设计,建议从高频场景开始逐步迭代。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Hugo_Hoo

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值