AI智能体（Agent）大模型入门【7】--构建传统的RAG应用-优快云博客

前言

本篇章将会介绍如何构建传统的RAG程序应用来增强检索你的大模型应用，减少大模型的幻听，来提高大模型的生成内容的准确性和实用性。主要是对代码的理解，此篇章没有运行参考。为主要的理论知识内容，需要结合后续的篇章才能进行演示。

如若有对RAG不太了解的参考此文档链接进行参考和学习：

AI智能体（Agent）大模型入门【1】--什么是RAG？-优快云博客

环境需要

需要安装操作使用milvus向量数据库

如若没有，请点击链接进行下载安装:快速入门 | Milvus 文档

创建rag的类

代码参考

import os
from abc  import  abstractmethod

from llama_index.core import VectorStoreIndex,load_index_from_storage
from llama_index.core.indices.base import BaseIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.storage.storage_context import DEFAULT_PERSIST_DIR,StorageContext

from llama_index.vector_stores.milvus import  MilvusVectorStore

class RAG:
    """
    RAG架构设计类
    该类提供了加载数据，创建索引(本地和远程)以及加载已经创建索引的功能。
    """
    def __init__(self,files:list[str]):
        """
        初始化RAG类
        :param files:需要处理文件列表
        """
        self.files = files

    @abstractmethod
    async def load_data(self):
        """
        加载数据:异步抽象方法，具体实现返回数据对象
        """

    async def create_index_local(self,persist_dir=DEFAULT_PERSIST_DIR):
        """
        常见本地索引
        :param persist_dir:本地持久化路径
        :return:BaseIndex
        """
        #加载数据
        data = await self.load_data()
        # 创建一个句子分割器
        node_parser = SentenceSplitter.from_defaults()
        # 从文档获取节点
        nodes = node_parser.get_nodes_from_documents(data)
        # 创建向量存储索引
        index = VectorStoreIndex(nodes)
        #对向量数据库做持久化
        index.storage_context.persist(persist_dir=persist_dir)
        # 返回创建的索引
        return index

    async def create_index(self,collection_name = "default")->BaseIndex:
        """
        创建远程索引
        :param collection_name:远程接名称，默认为"default"
        :return: BaseIndex
        """
        # 加载数据
        data = await  self.load_data()
        # 创建一个句子分割器
        node_parser = SentenceSplitter.from_defaults()
        # 从文档中获取节点
        nodes = node_parser.get_nodes_from_documents(data)
        # 创建向量存储索引
        vector_store = MilvusVectorStore(
        #     初始化Milvus向量数据库的连接
        #     指定集合名称和嵌入模型的的维度
            uri = os.getenv("MILVUS_URI"),
            collection_name=collection_name,
            dim= 512,
            overwrite = False
        )
        # 创建存储上下文，使用默认配置并指定向量存储
        storage_context = StorageContext.from_defaults(vector_store=vector_store)
        # 构建向量存储索引，基于给定的节点和存储上下文
        index = VectorStoreIndex.from_vector_store(nodes,storage_context=storage_context)
        return  index

    @staticmethod
    async def load_index(collection_name = "default")->BaseIndex:
        """
        加载远程索引
        :param collection_name: 远程集合名称，默认为”default“
        :return: BaseIndex
        """
        vector_store = MilvusVectorStore(
            #     初始化Milvus向量数据库的连接
            #     指定集合名称和嵌入模型的的维度
            uri=os.getenv("MILVUS_URI"),
            collection_name=collection_name,
            dim=512,
            overwrite=False
        )
        # 从⼀个向量存储中创建⼀个新的VectorStoreIndex实例
        # 此⽅法允许我们利⽤已经存在的向量存储来构建索引，⽽不是从头开始创建⼀个新的向量存储
        # 参数:
        #   vector_store: ⼀个向量存储实例，它包含了需要被索引的向量数据
        # 返回:
        #   ⼀个新的VectorStoreIndex实例，它将使⽤提供的向量存储作为数据源
        index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
        return index

    @staticmethod
    async def load_index_local(persist_dir = DEFAULT_PERSIST_DIR)->BaseIndex:
        """
        加载本地索引
        :param persist_dir: 本地持久化路径
        :return: BaseIndex
        """
        index = load_index_from_storage(StorageContext.from_defaults(persist_dir=persist_dir))
        return index

此文件我们作为base_rag.py文件作为rag构建的类，主要是对文件创建索引的方式，在构建的传统rag应用我们需要对文件进行切分，转换成向量，存储到向量数据库中，然后在向量数据库查找相似的向量进行文本的输出丢给大模型，让大模型整理生成回答者一个流程。

所以这一步就是创建向量索引，进行向量存储。

构建传统的rag应用方法

接下来就是要构建传统的rag应用方法

示例代码

import asyncio
import os
from datetime import datetime

# 导入SimpleDirectoryReader和Document类，用于处理文档的读取和表示
from llama_index.core import SimpleDirectoryReader

# 导入RAG类和OCR工具函数，用于构建检索增强生成模型和处理文件到文本的转换
from  .base_rag import RAG
from rag.utils import ocr_file_to_text_llm_kimi


class TraditionalRAG(RAG):
    """
    传统检索增强生成模型，继承自RAG类,用于处理传统文档。
    """
    async  def load_data(self):
        """
        异步加载数据方法。

        该方法遍历文件列表，使用ocr技术将图片转换为文邦，将文本保存为零食文件。
        然后，它读取零食文件的内容，创建Document对象，并将其添加到文档列表中。
        最后，删除临时文件返回文档列表
        :return:
        """
        # 初始化文档列表
        docs = []
        for file in self.files:
            print("正在处理文件：", file)
            # 对图片及文档进行OCR识别
            contents = ocr_file_to_text_llm_kimi(file)
            # 生成临时文件名，避免文件名冲突
            temp_file = datetime.now().strftime("%Y%m%d%H%M%S") + ".txt"
            # 将OCR识别的内容写入临时文件
            with open(temp_file, "w", encoding="utf-8") as f:
                f.write(contents)
            # 使用SimpleDirectoryReader读取临时文件内容并加载数据
            doc = SimpleDirectoryReader(input_files=[temp_file]).load_data()[0]
            # 将读取的数据转换为Document对象，并添加到文档列表中
            # doc = Document(text="\n\n".join([d.text for d in data[0:]]), metadata={"path": file})
            # print(doc)
            docs.append(doc)
            # 删除临时文件
            os.remove(temp_file)
            # 返回文档列表
        return docs
#     当作为主程序运行时，执行以下代码
if __name__ == "__main__":
    rag = TraditionalRAG(files=["../test_data/222.png"])
    asyncio.run(rag.load_index_local())

此过程函数就是将文件传递到函数的load_data类内，然后将文档丢给ocr进行识别操作，然后创建临时文件等等，代码内有解释。

OK到这里传统的rag构建也就完成了，不过其实后续我都会进行优化，不在采用这种方式进行编写，此篇章加强对rag的理解，关于此篇章使用的ocr将会在后续的篇章进行讲解和构建使用。