LangChain实战（二十）：构建多模态应用 - 结合视觉与文本

最新推荐文章于 2025-12-14 10:48:30 发布

原创

最新推荐文章于 2025-12-14 10:48:30 发布 · 993 阅读

21 ·

CC 4.0 BY-SA版权

文章标签：

#langchain #大模型应用开发 #RAG #LLM #Agents #AI

本文是《LangChain实战课》系列的第二十篇，将深入探讨如何将GPT-4-Vision等多模态模型与LangChain结合，处理图像和文本混合的任务。通过学习本文，您将掌握构建能够理解和处理视觉信息的智能应用的核心技术。

前言

随着GPT-4-Vision等多模态模型的出现，AI系统不再局限于处理文本信息，而是能够理解和分析图像、视频等视觉内容。这种能力的结合为AI应用开辟了全新的可能性。本文将指导您如何将多模态能力集成到LangChain框架中，构建真正能够"看"和"理解"的智能应用。

多模态AI的核心价值与应用场景

为什么需要多模态AI？

信息互补：图像和文本提供互补信息，结合后能产生更深入的理解
场景丰富：现实世界中的问题往往涉及多种类型的信息
用户体验：支持更自然的交互方式（如拍照提问）
应用扩展：开启全新的应用场景和能力边界

典型应用场景

图像描述与问答：分析图像内容并回答相关问题
文档理解：处理包含文字和图像的复杂文档
视觉搜索：基于图像内容进行信息检索
内容审核：同时分析文本和图像内容进行安全审核
教育辅助：解析图表、示意图等教育材料

环境准备与安装

首先安装必要的依赖包：

# 安装核心库
pip install langchain openai python-dotenv

# 安装图像处理库
pip install pillow opencv-python

# 安装多模态支持库
pip install base64 requests

# 安装可选的可视化库
pip install matplotlib

# 安装文档处理库（用于处理PDF等包含图像的文档）
pip install pymupdf python-pptx

设置必要的环境变量：

export OPENAI_API_KEY="your-openai-api-key"

多模态模型基础：GPT-4-Vision入门

1. 直接调用GPT-4-Vision API

首先，让我们了解如何直接使用GPT-4-Vision API：

import base64
import requests
from PIL import Image
import io
import os

class GPT4VisionClient:
    def __init__(self):
        self.api_key = os.getenv("OPENAI_API_KEY")
        self.api_url = "https://api.openai.com/v1/chat/completions"
        self.headers = {
   
   
            "Content-Type": "application/json",
            "Authorization": f"Bearer {
     
     self.api_key}"
        }
    
    def encode_image(self, image_path):
        """将图像编码为base64"""
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode('utf-8')
    
    def analyze_image(self, image_path, prompt):
        """使用GPT-4-Vision分析图像"""
        # 编码图像
        base64_image = self.encode_image(image_path)
        
        # 构建请求载荷
        payload = {
   
   
            "model": "gpt-4-vision-preview",
            "messages": [
                {
   
   
                    "role": "user",
                    "content": [
                        {
   
   
                            "type": "text",
                            "text": prompt
                        },
                        {
   
   
                            "type": "image_url",
                            "image_url": {
   
   
                                "url": f"data:image/jpeg;base64,{
     
     base64_image}"
                            }
                        }
                    ]
                }
            ],
            "max_tokens": 1000
        }
        
        # 发送请求
        response = requests.post(self.api_url, headers=self.headers, json=payload)
        response_data = response.json()
        
        return response_data["choices"][0]["message"]["content"]
    
    def analyze_multiple_images(self, image_paths, prompt):
        """分析多张图像"""
        content = [{
   
   "type": "text", "text": prompt}]
        
        for image_path in image_paths:
            base64_image = self.encode_image(image_path)
            content.append({
   
   
                "type": "image_url",
                "image_url": {
   
   
                    "url": f"data:image/jpeg;base64,{
     
     base64_image}"
                }
            })
        
        payload = {
   
   
            "model": "gpt-4-vision-preview",
            "messages": [
                {
   
   
                    "role": "user",
                    "content": content
                }
            ],
            "max_tokens": 2000
        }
        
        response = requests.post(self.api_url, headers=self.headers, json=payload)
        response_data = response.json()
        
        return response_data["choices"][0]["message"]["content"]

# 使用示例
vision_client = GPT4VisionClient()

# 分析单张图像
result = vision_client.analyze_image("path/to/image.jpg", "描述这张图片的内容")
print(result)

# 分析多张图像
results = vision_client.analyze_multiple_images(
    ["image1.jpg", "image2.jpg"], 
    "比较这两张图片的异同"
)
print(results)

2. 图像预处理工具

在处理图像之前，通常需要进行一些预处理：

from PIL import Image, ImageEnhance, ImageFilter
import cv2
import numpy as np

class ImagePreprocessor:
    def __init__(self):
        pass
    
    def resize_image(self, image_path, max_size=1024):
        """调整图像大小，保持宽高比"""
        img = Image.open(image_path)
        
        # 计算新的尺寸
        width, height = img.size
        if max(width, height) > max_size:
            if width > height:
                new_width = max_size
                new_height = int(height * (max_size / width))
            else:
                new_height = max_size
                new_width = int(width * (max_size / height))
            
            img = img.resize((new_width, new_height), Image.Resampling.LANCZOS)
        
        # 保存调整后的图像
        output_path = image_path.replace(".", "_resized.")
        img.save(output_path)
        return output_path
    
    def enhance_image(self, image_path, contrast=1.2, sharpness=1.1):
        """增强图像质量"""
        img = Image.open(image_path)
        
        # 增强对比度
        enhancer = ImageEnhance.Contrast(img)
        img = enhancer.enhance(contrast)
        
        # 增强锐度
        enhancer = ImageEnhance.Sharpness(img)
        img = enhancer.enhance(sharpness)
        
        output_path = image_path.replace(".", "_enhanced.")
        img.save(output_path)
        return output_path
    
    def extract_text_from_image(self, image_path):
        """从图像中提取文本（使用OCR）"""
        try:
            import pytesseract
            img = Image.open(image_path)
            text = pytesseract.image_to_string(img, lang='chi_sim+eng')
            return text
        except ImportError:
            return "请安装pytesseract以使用OCR功能"
    
    def detect_objects(self, image_path):
        """物体检测（示例实现）"""
        # 这里使用OpenCV进行简单的物体检测
        image = cv2.imread(image_path)
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        
        # 使用边缘检测
        edges = cv2.Canny(gray, 100, 200)
        
        # 寻找轮廓
        contours, _ = cv2.findContours(edges, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
        
        return f"检测到 {
     
     len(contours)} 个物体轮廓"

# 使用示例
preprocessor = ImagePreprocessor()
resized_image = preprocessor.resize_image("path/to/image.jpg")
enhanced_image = preprocessor.enhance_image(resized_image)
text_content = preprocessor.extract_text_from_image(enhanced_image)
object_info = preprocessor.detect_objects(enhanced_image)

将多模态能力集成到LangChain

1. 创建多模态工具

首先，我们创建一些处理图像和多模态内容的工具：

from langchain.tools import BaseTool
from pydantic import BaseModel, Field
from typing import Type, Optional, List
import os

class ImageAnalysisInput(BaseModel):
    image_path: str = Field(description="图像文件路径")
    question: str = Field(description="关于图像的问题或指令")

class MultiImageAnalysisInput