AutoRAG相似度过滤：余弦相似度阈值优化-优快云博客

AutoRAG相似度过滤：余弦相似度阈值优化

【免费下载链接】AutoRAG RAG AutoML Tool - Find optimal RAG pipeline for your own data. 项目地址: https://gitcode.com/GitHub_Trending/au/AutoRAG

引言：RAG系统中的相似度过滤挑战

在检索增强生成（Retrieval Augmented Generation，RAG）系统中，相似度过滤是确保检索质量的关键环节。当用户查询进入系统时，RAG会从知识库中检索相关文档片段（passages），但并非所有检索结果都同等相关。低相似度的文档不仅无法提供有用信息，还可能误导大语言模型（LLM）生成不准确的回答。

AutoRAG的相似度阈值过滤模块（Similarity Threshold Cutoff）正是为了解决这一痛点而生。它基于余弦相似度计算，智能地过滤掉低相关性文档，确保只有高质量的检索结果进入后续处理流程。

余弦相似度：数学原理与实现

数学公式

余弦相似度通过计算两个向量夹角的余弦值来衡量它们的相似程度，公式如下：

$$ \text{similarity} = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| \times |\mathbf{B}|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \times \sqrt{\sum_{i=1}^{n} B_i^2}} $$

其中：

$\mathbf{A}$ 和 $\mathbf{B}$ 分别是查询和文档的嵌入向量
$\cdot$ 表示向量点积
$|\cdot|$ 表示向量的欧几里得范数

AutoRAG中的实现

AutoRAG使用NumPy高效实现余弦相似度计算：

def calculate_cosine_similarity(a, b):
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    similarity = dot_product / (norm_a * norm_b)
    return similarity

相似度阈值过滤的工作流程

处理流程示意图

mermaid

核心算法逻辑

def _pure(self, queries, contents_list, scores_list, ids_list, threshold, batch=128):
    # 嵌入查询和内容
    query_embeddings, content_embeddings = embedding_query_content(
        queries, contents_list, self.embedding_model, batch
    )
    
    # 计算相似度并过滤
    remain_indices = list(map(
        lambda x: self.__row_pure(x[0], x[1], threshold),
        zip(query_embeddings, content_embeddings),
    ))
    
    # 构建过滤后结果
    remain_content_list = list(map(lambda c, idx: [c[i] for i in idx], contents_list, remain_indices))
    remain_scores_list = list(map(lambda s, idx: [s[i] for i in idx], scores_list, remain_indices))
    remain_ids_list = list(map(lambda _id, idx: [_id[i] for i in idx], ids_list, remain_indices))
    
    return remain_content_list, remain_ids_list, remain_scores_list

阈值优化的关键策略

1. 阈值选择原则

阈值范围	过滤效果	适用场景	风险
0.9-1.0	极其严格	高精度要求的专业领域	可能过滤过多相关文档
0.8-0.9	严格	一般商业应用	平衡精度和召回率
0.7-0.8	适中	内容丰富的知识库	可能包含一些噪声
0.6-0.7	宽松	探索性搜索	包含较多不相关结果

2. 自适应阈值策略

# 基于数据特性动态调整阈值
def adaptive_threshold_strategy(similarity_scores):
    mean_score = np.mean(similarity_scores)
    std_score = np.std(similarity_scores)
    
    # 基于统计特征设置阈值
    if std_score > 0.2:
        # 分数分布分散，使用较高阈值确保质量
        return min(0.85, mean_score + 0.5 * std_score)
    else:
        # 分数集中，使用适中阈值
        return max(0.7, mean_score - 0.2 * std_score)

3. 多阈值实验配置

在AutoRAG的YAML配置中，可以通过策略优化自动寻找最佳阈值：

node_lines:
  - node_line_name: filtering_pipeline
    nodes:
      - node_type: passage_filter
        strategy:
          metrics: [retrieval_f1, retrieval_precision, retrieval_recall]
          speed_threshold: 3
        modules:
          - module_type: similarity_threshold_cutoff
            threshold: 0.75
            embedding_model: openai
            batch: 64
          - module_type: similarity_threshold_cutoff
            threshold: 0.80
            embedding_model: openai
            batch: 64
          - module_type: similarity_threshold_cutoff
            threshold: 0.85
            embedding_model: openai
            batch: 64

性能优化技巧

批量处理优化

# 使用批量处理减少嵌入模型调用次数
def embedding_query_content(queries, contents_list, embedding_model, batch_size):
    all_queries = []
    all_contents = []
    
    for query, contents in zip(queries, contents_list):
        all_queries.extend([query] * len(contents))
        all_contents.extend(contents)
    
    # 批量嵌入
    query_embeddings = embedding_model.embed(all_queries, batch_size=batch_size)
    content_embeddings = embedding_model.embed(all_contents, batch_size=batch_size)
    
    return query_embeddings, content_embeddings

内存管理策略

# 及时清理GPU内存
def __del__(self):
    del self.embedding_model
    empty_cuda_cache()  # 清空CUDA缓存
    super().__del__()

实际应用案例

案例1：技术文档检索优化

问题：技术文档检索中，相似度分数普遍较高（0.8-0.95），需要精细过滤。

解决方案：

modules:
  - module_type: similarity_threshold_cutoff
    threshold: 0.88  # 较高阈值确保技术准确性
    embedding_model: text-embedding-3-large
    batch: 32  # 较小批量确保精度

案例2：客服问答系统

问题：用户问题多样，需要平衡召回率和精度。

解决方案：

modules:
  - module_type: similarity_threshold_cutoff
    threshold: 0.78  # 适中阈值
    embedding_model: all-MiniLM-L6-v2  # 轻量级模型
    batch: 128  # 较大批量提高效率

评估指标与优化效果

关键评估指标

指标	计算公式	优化目标
检索精度（Precision）	$\frac{\text{相关检索结果}}{\text{总检索结果}}$	提高
检索召回率（Recall）	$\frac{\text{相关检索结果}}{\text{总相关文档}}$	保持
F1分数	$2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$	优化

优化效果对比

mermaid

最佳实践建议

数据驱动的阈值选择：基于实际数据分布选择阈值，而非固定值
嵌入模型匹配：选择与领域匹配的嵌入模型，技术文档使用技术专用模型
批量大小优化：根据硬件资源调整批量大小，GPU资源充足时使用较大批量
多阈值实验：在AutoRAG中配置多个阈值模块，自动寻找最优值
监控与调整：持续监控过滤效果，根据业务需求动态调整阈值

总结

【免费下载链接】AutoRAG RAG AutoML Tool - Find optimal RAG pipeline for your own data. 项目地址: https://gitcode.com/GitHub_Trending/au/AutoRAG

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考