pgvector混合搜索:向量搜索与全文搜索的结合

pgvector混合搜索:向量搜索与全文搜索的结合

【免费下载链接】pgvector Open-source vector similarity search for Postgres 【免费下载链接】pgvector 项目地址: https://gitcode.com/GitHub_Trending/pg/pgvector

引言:现代搜索的挑战与机遇

在当今数据爆炸的时代,传统的文本搜索已经无法满足用户对精准信息检索的需求。想象一下这样的场景:用户输入"人工智能应用案例",传统搜索引擎可能返回大量包含这些关键词的文档,但无法理解用户真正想要的是具体的AI应用实例而非理论介绍。这就是为什么我们需要混合搜索(Hybrid Search)——将语义理解能力与关键词匹配精度完美结合。

pgvector作为PostgreSQL的开源向量相似度搜索扩展,为开发者提供了强大的向量搜索能力。当它与PostgreSQL内置的全文搜索功能结合时,就能构建出既理解语义又保持关键词精度的下一代搜索系统。

混合搜索的核心概念

什么是混合搜索?

混合搜索(Hybrid Search)是一种结合了两种或多种搜索技术的检索方法,通常包括:

  • 向量搜索(Vector Search):基于语义相似度的搜索,能够理解查询的深层含义
  • 全文搜索(Full-Text Search):基于关键词匹配的传统搜索,提供精确的文本匹配

mermaid

为什么需要混合搜索?

搜索类型优势局限性
向量搜索语义理解、同义词处理、多语言支持可能忽略重要关键词
全文搜索精确匹配、布尔逻辑、短语搜索无法理解语义上下文

混合搜索通过结合两者的优势,提供了更全面和准确的搜索结果。

pgvector与全文搜索的集成架构

系统架构设计

mermaid

实战:构建混合搜索系统

环境准备与安装

首先确保已安装pgvector扩展:

-- 创建扩展
CREATE EXTENSION vector;

-- 检查扩展是否安装成功
SELECT * FROM pg_extension WHERE extname = 'vector';

数据表设计

设计一个支持混合搜索的文章表:

CREATE TABLE articles (
    id BIGSERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    content TEXT NOT NULL,
    -- 全文搜索字段
    search_content TSVECTOR,
    -- 向量嵌入字段(使用OpenAI兼容的1536维度)
    embedding VECTOR(1536),
    -- 元数据字段
    category_id INTEGER,
    publish_date DATE,
    author TEXT,
    -- 索引
    CONSTRAINT fk_category FOREIGN KEY (category_id) REFERENCES categories(id)
);

-- 创建GIN索引用于全文搜索
CREATE INDEX idx_articles_search_content ON articles USING GIN(search_content);

-- 创建HNSW索引用于向量搜索
CREATE INDEX idx_articles_embedding ON articles USING hnsw (embedding vector_cosine_ops);

数据预处理与索引构建

全文搜索索引构建
-- 创建触发器函数来自动维护全文搜索索引
CREATE OR REPLACE FUNCTION articles_search_content_trigger() RETURNS trigger AS $$
begin
    new.search_content := 
        setweight(to_tsvector('english', coalesce(new.title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(new.content, '')), 'B') ||
        setweight(to_tsvector('english', coalesce(new.author, '')), 'C');
    return new;
end
$$ LANGUAGE plpgsql;

-- 创建触发器
CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE 
ON articles FOR EACH ROW EXECUTE FUNCTION articles_search_content_trigger();

-- 立即更新现有数据的全文搜索索引
UPDATE articles SET title = title;
向量数据处理

假设我们使用Python处理向量嵌入:

import psycopg2
import numpy as np
from sentence_transformers import SentenceTransformer

# 初始化模型
model = SentenceTransformer('all-MiniLM-L6-v2')

def update_article_embeddings():
    conn = psycopg2.connect("dbname=your_db user=your_user")
    cur = conn.cursor()
    
    # 获取需要处理的内容
    cur.execute("SELECT id, content FROM articles WHERE embedding IS NULL")
    articles = cur.fetchall()
    
    for article_id, content in articles:
        # 生成嵌入向量
        embedding = model.encode(content)
        
        # 更新数据库
        cur.execute(
            "UPDATE articles SET embedding = %s WHERE id = %s",
            (embedding.tolist(), article_id)
        )
    
    conn.commit()
    cur.close()
    conn.close()

混合搜索查询实现

基础混合搜索查询
-- 简单的混合搜索查询
SELECT 
    id,
    title,
    content,
    -- 全文搜索相关度评分
    ts_rank_cd(search_content, plainto_tsquery('artificial intelligence')) AS text_score,
    -- 向量相似度评分(余弦相似度)
    1 - (embedding <=> %s) AS vector_score,
    -- 综合评分(加权平均)
    (0.4 * ts_rank_cd(search_content, plainto_tsquery('artificial intelligence')) + 
     0.6 * (1 - (embedding <=> %s))) AS combined_score
FROM articles
WHERE 
    -- 全文搜索条件
    search_content @@ plainto_tsquery('artificial intelligence') OR
    -- 向量搜索条件(相似度阈值)
    embedding <=> %s < 0.3
ORDER BY combined_score DESC
LIMIT 10;
高级混合搜索策略

策略1: Reciprocal Rank Fusion (RRF)

WITH text_results AS (
    SELECT 
        id,
        title,
        content,
        ts_rank_cd(search_content, plainto_tsquery('ai applications')) AS rank,
        ROW_NUMBER() OVER (ORDER BY ts_rank_cd(search_content, plainto_tsquery('ai applications')) DESC) AS text_rank
    FROM articles
    WHERE search_content @@ plainto_tsquery('ai applications')
    LIMIT 50
),
vector_results AS (
    SELECT 
        id,
        title,
        content,
        1 - (embedding <=> %s) AS similarity,
        ROW_NUMBER() OVER (ORDER BY embedding <=> %s) AS vector_rank
    FROM articles
    WHERE embedding <=> %s < 0.4
    LIMIT 50
),
combined_results AS (
    SELECT 
        COALESCE(t.id, v.id) AS id,
        COALESCE(t.title, v.title) AS title,
        COALESCE(t.content, v.content) AS content,
        COALESCE(1.0 / (60 + t.text_rank), 0) AS text_rrf,
        COALESCE(1.0 / (60 + v.vector_rank), 0) AS vector_rrf,
        COALESCE(1.0 / (60 + t.text_rank), 0) + COALESCE(1.0 / (60 + v.vector_rank), 0) AS combined_rrf
    FROM text_results t
    FULL OUTER JOIN vector_results v ON t.id = v.id
)
SELECT 
    id,
    title,
    content,
    combined_rrf AS score
FROM combined_results
ORDER BY combined_rrf DESC
LIMIT 20;

策略2: 加权分数融合

SELECT 
    id,
    title,
    content,
    -- 标准化全文搜索分数(0-1范围)
    (ts_rank_cd(search_content, plainto_tsquery('machine learning')) - min_text) / (max_text - min_text) AS normalized_text_score,
    -- 向量相似度分数(已经是0-1范围)
    1 - (embedding <=> %s) AS vector_score,
    -- 加权综合分数
    (0.3 * ((ts_rank_cd(search_content, plainto_tsquery('machine learning')) - min_text) / (max_text - min_text)) + 
     0.7 * (1 - (embedding <=> %s))) AS final_score
FROM articles,
LATERAL (
    SELECT 
        MIN(ts_rank_cd(search_content, plainto_tsquery('machine learning'))) AS min_text,
        MAX(ts_rank_cd(search_content, plainto_tsquery('machine learning'))) AS max_text
    FROM articles
    WHERE search_content @@ plainto_tsquery('machine learning')
) AS stats
WHERE 
    search_content @@ plainto_tsquery('machine learning') OR
    embedding <=> %s < 0.35
ORDER BY final_score DESC
LIMIT 15;

性能优化策略

索引优化
-- 创建复合索引以提高混合查询性能
CREATE INDEX idx_articles_hybrid ON articles 
USING GIN (search_content, (embedding::text));

-- 调整HNSW索引参数以提高查询性能
CREATE INDEX idx_articles_embedding_optimized ON articles 
USING hnsw (embedding vector_cosine_ops) 
WITH (m = 16, ef_construction = 200);

-- 设置查询时参数优化
SET hnsw.ef_search = 100;
SET ivfflat.probes = 20;
查询优化技巧
-- 使用CTE优化复杂查询
WITH filtered_articles AS (
    SELECT id, title, content, search_content, embedding
    FROM articles
    WHERE 
        category_id = 5 AND
        publish_date > '2024-01-01'
),
text_search AS (
    SELECT id, ts_rank_cd(search_content, plainto_tsquery('deep learning')) AS score
    FROM filtered_articles
    WHERE search_content @@ plainto_tsquery('deep learning')
),
vector_search AS (
    SELECT id, 1 - (embedding <=> %s) AS score
    FROM filtered_articles
    WHERE embedding <=> %s < 0.4
)
SELECT 
    f.id,
    f.title,
    COALESCE(t.score, 0) AS text_score,
    COALESCE(v.score, 0) AS vector_score,
    (COALESCE(t.score, 0) * 0.4 + COALESCE(v.score, 0) * 0.6) AS combined_score
FROM filtered_articles f
LEFT JOIN text_search t ON f.id = t.id
LEFT JOIN vector_search v ON f.id = v.id
WHERE t.id IS NOT NULL OR v.id IS NOT NULL
ORDER BY combined_score DESC
LIMIT 20;

实际应用场景示例

场景1:技术文档搜索
-- 搜索与"神经网络优化"相关的技术文档
SELECT 
    id,
    title,
    -- 高亮匹配的关键词
    ts_headline('english', content, plainto_tsquery('neural network optimization'), 'StartSel=<mark>, StopSel=</mark>') AS highlighted_content,
    ts_rank_cd(search_content, plainto_tsquery('neural network optimization')) AS text_relevance,
    1 - (embedding <=> %s) AS semantic_similarity,
    (0.4 * ts_rank_cd(search_content, plainto_tsquery('neural network optimization')) + 
     0.6 * (1 - (embedding <=> %s))) AS overall_score
FROM technical_docs
WHERE 
    (search_content @@ plainto_tsquery('neural network optimization') AND
     ts_rank_cd(search_content, plainto_tsquery('neural network optimization')) > 0.1) OR
    embedding <=> %s < 0.3
ORDER BY overall_score DESC
LIMIT 10;
场景2:电子商务产品搜索
-- 搜索"环保智能手机"相关产品
WITH text_products AS (
    SELECT 
        product_id,
        name,
        description,
        ts_rank_cd(search_content, plainto_tsquery('eco-friendly & smartphone')) AS text_score,
        RANK() OVER (ORDER BY ts_rank_cd(search_content, plainto_tsquery('eco-friendly & smartphone')) DESC) AS text_rank
    FROM products
    WHERE search_content @@ plainto_tsquery('eco-friendly & smartphone')
    AND category = 'electronics'
    LIMIT 50
),
semantic_products AS (
    SELECT 
        product_id,
        name,
        description,
        1 - (embedding <=> %s) AS semantic_score,
        RANK() OVER (ORDER BY embedding <=> %s) AS semantic_rank
    FROM products
    WHERE embedding <=> %s < 0.35
    AND category = 'electronics'
    LIMIT 50
),
combined AS (
    SELECT 
        COALESCE(t.product_id, s.product_id) AS product_id,
        COALESCE(t.name, s.name) AS name,
        COALESCE(t.description, s.description) AS description,
        COALESCE(1.0 / (50 + t.text_rank), 0) AS text_rrf,
        COALESCE(1.0 / (50 + s.semantic_rank), 0) AS semantic_rrf,
        COALESCE(1.0 / (50 + t.text_rank), 0) + COALESCE(1.0 / (50 + s.semantic_rank), 0) AS final_score
    FROM text_products t
    FULL OUTER JOIN semantic_products s ON t.product_id = s.product_id
)
SELECT 
    product_id,
    name,
    description,
    final_score
FROM combined
ORDER BY final_score DESC
LIMIT 20;

性能监控与调优

查询性能分析

-- 监控混合搜索查询性能
EXPLAIN ANALYZE
SELECT 
    id,
    title,
    ts_rank_cd(search_content, plainto_tsquery('artificial intelligence')) AS text_score,
    1 - (embedding <=> %s) AS vector_score
FROM articles
WHERE 
    search_content @@ plainto_tsquery('artificial intelligence') OR
    embedding <=> %s < 0.3
ORDER BY (0.4 * ts_rank_cd(search_content, plainto_tsquery('artificial intelligence')) + 
          0.6 * (1 - (embedding <=> %s))) DESC
LIMIT 10;

-- 查看索引使用情况
SELECT 
    indexname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes 
WHERE tablename = 'articles';

资源优化配置

-- 调整内存设置以优化性能
SET work_mem = '64MB';
SET maintenance_work_mem = '1GB';
SET effective_cache_size = '4GB';

-- 监控系统性能
SELECT 
    query,
    calls,
    total_time,
    mean_time,
    rows
FROM pg_stat_statements
WHERE query LIKE '%embedding%' OR query LIKE '%ts_rank%'
ORDER BY total_time DESC
LIMIT 10;

最佳实践与常见问题解决

最佳实践

  1. 数据预处理

    • 确保文本数据清洁,去除无关字符
    • 统一向量嵌入模型的维度和预处理流程
    • 定期更新全文搜索索引和向量索引
  2. 权重调优

    • 根据业务需求调整向量搜索和全文搜索的权重比例
    • 使用A/B测试确定最优参数配置
  3. 性能监控

    • 定期分析查询性能
    • 监控索引使用情况
    • 优化数据库配置参数

常见问题及解决方案

问题症状解决方案
查询性能下降响应时间变长,CPU使用率高优化索引参数,增加内存配置
搜索结果不准确相关文档排名靠后调整权重参数,优化预处理流程
内存不足查询失败,内存错误增加work_mem,优化查询结构
索引膨胀索引大小快速增长定期重建索引,使用部分索引

结论

pgvector与PostgreSQL全文搜索的结合为开发者提供了构建下一代智能搜索系统的强大工具。通过混合搜索策略,我们能够在保持关键词搜索精度的同时,获得语义理解的能力,从而为用户提供更加准确和相关的搜索结果。

关键要点:

【免费下载链接】pgvector Open-source vector similarity search for Postgres 【免费下载链接】pgvector 项目地址: https://gitcode.com/GitHub_Trending/pg/pgvector

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值