PureRAG：基于向量数据库的API文档检索与问答系统_deepseek bge-m3获取向量数据-优快云博客

在HarmonyOS智能编程助手项目中，我们开发了一个名为PureRAG的子系统，它是一个基于检索增强生成（Retrieval-Augmented Generation，RAG）技术的API文档检索与问答系统。本文将介绍PureRAG的设计思路、核心组件和关键实现。

1. PureRAG系统概述

PureRAG系统主要解决的问题是：如何让开发者能够通过自然语言查询快速获取HarmonyOS API文档中的相关信息。系统采用了RAG架构，将文档内容向量化存储，通过语义检索找到与用户查询最相关的文档片段，然后利用大语言模型生成准确的回答。

PureRAG系统包含四个核心模块：

数据库构建模块：负责将API文档转换为向量并构建向量数据库
数据库提取模块：负责从向量数据库中加载和提取文档数据
查询引擎模块：负责处理用户查询，检索相关文档并生成回答
检索器模块：提供兼容APIQASystem的检索接口

2. 向量数据库构建

向量数据库构建是PureRAG系统的基础。我们使用DeepSeek的BGE-M3模型将API文档转换为向量，并构建RAG数据库。

2.1 数据库结构设计

class RAGDatabase:
    """RAG数据库类"""
    
    def __init__(self):
        """初始化RAG数据库"""
        self.docs = []  # 文档列表
        self.embeddings = []  # 嵌入向量列表
        self.doc_ids = []  # 文档ID列表

数据库包含三个主要组件：

docs：存储文档内容，包括标题、概述和各个章节
embeddings：存储文档的向量表示
doc_ids：存储文档的唯一标识符

2.2 文档向量化

文档向量化是通过get_embedding函数实现的，它调用DeepSeek的BGE-M3模型API：

def get_embedding(text: str) -> List[float]:
    """
    使用DeepSeek的BGE-M3模型获取文本嵌入
    
    Args:
        text: 要嵌入的文本
        
    Returns:
        嵌入向量
    """
    headers = {
        "Content-Type": "application/json"
    }
    if API_KEY:
        headers["Authorization"] = f"Bearer {API_KEY}"
    
    payload = {
        "model": MODEL,
        "input": text
    }
    
    try:
        response = requests.post(
            f"{API_URL}/embeddings",
            headers=headers,
            json=payload
        )
        response.raise_for_status()
        
        result = response.json()
        embedding = result.get("data", [{}])[0].get("embedding", [])
        
        return embedding
    except Exception as e:
        print(f"获取嵌入向量失败: {str(e)}")
        # 如果失败，返回空向量
        return []

2.3 数据库保存

构建好的数据库通过save方法保存到磁盘，包括文档内容、向量和索引信息：

def save(self, db_path: str):
    """
    保存数据库到文件
    
    Args:
        db_path: 数据库路径
    """
    # 确保目录存在
    os.makedirs(db_path, exist_ok=True)
    
    # 保存文档
    docs_file = os.path.join(db_path, 'docs.json')
    with open(docs_file, 'w', encoding='utf-8') as f:
        json.dump(self.docs, f, ensure_ascii=False, indent=2)
    
    # 保存嵌入向量
    embeddings_file = os.path.join(db_path, 'embeddings.pkl')
    embeddings_array = np.array(self.embeddings)
    with open(embeddings_file, 'wb') as f:
        pickle.dump(embeddings_array, f)
    
    # 保存文档ID
    doc_ids_file = os.path.join(db_path, 'doc_ids.json')
    with open(doc_ids_file, 'w', encoding='utf-8') as f:
        json.dump(self.doc_ids, f, ensure_ascii=False)
    
    # 保存索引文件
    index_file = os.path.join(db_path, 'db_index.json')
    index_data = {
        'total_docs': len(self.docs),
        'docs_file': 'docs.json',
        'embeddings_file': 'embeddings.pkl',
        'doc_ids_file': 'doc_ids.json'
    }
    with open(index_file, 'w', encoding='utf-8') as f:
        json.dump(index_data, f, ensure_ascii=False, indent=2)

3. 数据库提取

数据库提取模块中的RAGDatabaseExtractor类负责从磁盘加载向量数据库并提供各种检索方法：

class RAGDatabaseExtractor:
    """RAG数据库提取器类"""
    
    def __init__(self, db_path: str = DB_PATH):
        """
        初始化RAG数据库提取器
        
        Args:
            db_path: 数据库路径
        """
        self.db_path = db_path
        self.docs = []
        self.embeddings = None
        self.doc_ids = []
        
        # 加载数据库
        self.load()

该类提供了多种检索方法，包括：

通过ID获取文档：

def get_doc_by_id(self, doc_id: str) -> Dict[str, Any]:
    """
    根据ID获取文档
    
    Args:
        doc_id: 文档ID
        
    Returns:
        文档数据
    """
    for doc in self.docs:
        if doc.get('doc_id') == doc_id:
            return doc
    
    return {}

通过索引获取文档：

def get_doc_by_index(self, index: int) -> Dict[str, Any]:
    """
    根据索引获取文档
    
    Args:
        index: 文档索引
        
    Returns:
        文档数据
    """
    if 0 <= index < len(self.docs):
        return self.docs[index]
    
    return {}

4. 查询引擎

查询引擎模块中的RAGQueryEngine类是PureRAG系统的核心，负责处理用户查询、检索相关文档并生成回答。

4.1 相关文档检索

检索相关文档是通过计算查询向量与文档向量之间的相似度实现的：

def retrieve_relevant_docs(self, query: str, top_k: int = 3) -> List[Dict[str, Any]]:
    """
    检索与查询相关的文档
    
    Args:
        query: 查询文本
        top_k: 返回的最相关文档数量
        
    Returns:
        最相关的文档列表
    """
    # 获取查询的嵌入向量
    query_embedding = self.get_query_embedding(query)
    
    if not query_embedding or self.db.embeddings is None:
        print("警告: 无法获取查询嵌入向量或数据库嵌入向量为空")
        return []
    
    # 计算相似度
    similarities = self.compute_similarity(query_embedding, self.db.embeddings)
    
    # 获取最相关的文档索引
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    
    # 获取最相关的文档
    relevant_docs = [self.db.get_doc_by_index(i) for i in top_indices]
    
    # 添加相似度分数
    for i, doc in enumerate(relevant_docs):
        doc['similarity'] = float(similarities[top_indices[i]])
    
    return relevant_docs

4.2 回答生成

回答生成是通过DeepSeek-R1大语言模型实现的：

def generate_answer(self, query: str, relevant_docs: List[Dict[str, Any]]) -> str:
    """
    生成回答
    
    Args:
        query: 查询文本
        relevant_docs: 相关文档列表
        
    Returns:
        生成的回答
    """
    # 构建提示
    context = ""
    for i, doc in enumerate(relevant_docs):
        doc_content = self.extract_doc_content(doc)
        context += f"文档{i+1}:\n{doc_content}\n\n"
    
    prompt = f"""请基于以下API文档内容回答用户的问题。如果文档中没有相关信息，请直接说明无法回答。

文档内容:
{context}

用户问题: {query}

请提供准确、简洁的回答，并尽可能引用文档中的相关内容。"""
    
    # 使用DeepSeek-R1模型生成回答
    messages = [
        {"role": "user", "content": prompt}
    ]
    
    answer = self.client.chat_completion(messages)
    
    return answer

4.3 文档内容提取

为了让大语言模型能够更好地理解文档内容，我们实现了extract_doc_content方法来提取文档的结构化内容：

def extract_doc_content(self, doc: Dict[str, Any]) -> str:
    """
    提取文档内容，包括表格和代码块
    
    Args:
        doc: 文档数据
        
    Returns:
        文档内容文本
    """
    title = doc.get('title', '')
    overview = doc.get('overview', '')
    
    content = f"标题: {title}\n\n概述: {overview}\n\n"
    
    # 添加各个部分的内容
    for i, section in enumerate(doc.get('sections', [])):
        section_title = section.get('title', '')
        section_content = section.get('content', '')
        
        content += f"部分{i+1}: {section_title}\n{section_content}\n\n"
        
        # 处理表格
        if section.get('tables'):
            content += "表格内容（html格式）:\n"
            for table_html in section.get('tables', []):
                content += f"{table_html}\n\n"

        # 处理代码块
        if section.get('code_blocks'):
            content += "代码示例（html格式）:\n"
            for code_html in section.get('code_blocks', []):
                content += f"{code_html}\n\n"
    
    return content

5. 检索器接口

检索器模块中的PureRAGRetriever类提供了兼容APIQASystem的检索接口，使PureRAG系统能够与其他组件无缝集成：

def retrieve(self, query: str, top_k: int = 3) -> List[Dict[str, Any]]:
    """
    检索与查询相关的文档
    
    Args:
        query: 查询文本
        top_k: 返回的最相关文档数量
        
    Returns:
        最相关的文档列表，格式适配APIQASystem
    """
    # 获取查询的嵌入向量
    query_embedding = self.get_query_embedding(query)
    
    if not query_embedding or self.db.embeddings is None:
        print("警告: 无法获取查询嵌入向量或数据库嵌入向量为空")
        return []
    
    # 计算相似度
    similarities = self.compute_similarity(query_embedding, self.db.embeddings)
    
    # 获取最相关的文档索引
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    
    # 获取最相关的文档
    relevant_docs = [self.db.get_doc_by_index(i) for i in top_indices]
    
    # 转换为 APIQASystem 兼容的格式
    results = []
    for i, doc in enumerate(relevant_docs):
        # 提取文档内容
        content = self.extract_doc_content(doc)
        
        # 构建兼容的文档格式
        result = {
            "label": doc.get("title", ""),
            "text": content,
            "score": float(similarities[top_indices[i]]),
            "node_type": "API文档",
            "doc_id": doc.get("doc_id", "")
        }
        results.append(result)
    
    return results

6. 系统工作流程

PureRAG系统的完整工作流程如下：

数据库构建阶段：
- 读取API文档JSON文件
- 使用BGE-M3模型将文档转换为向量
- 将文档内容和向量保存到数据库
查询处理阶段：
- 接收用户查询
- 使用BGE-M3模型将查询转换为向量
- 计算查询向量与文档向量的相似度
- 检索最相关的文档
- 提取文档内容
- 使用DeepSeek-R1模型生成回答