关于向量的基础概念,可以参考:向量数据库学习笔记(1) —— 基础概念-优快云博客
一、 pgvector简介
pgvector 是一款开源的、基于pg的、向量相似性搜索 插件,将您的向量数据与其他数据统一存储在pg中。支持功能包括:
- 精确与近似最近邻搜索
- 单精度/半精度浮点向量、二进制向量及稀疏向量
- 多种距离度量:L2距离、内积、余弦距离、L1距离、汉明距离、杰卡德距离
- 支持所有具有Postgres客户端的编程语言
- 同时继承PostgreSQL全部核心优势
二、 安装与启用
https://github.com/pgvector/pgvector
下载并解压安装包
cd pgvector
make
make install
插件安装(注意它叫vector,不叫pgvector)
CREATE EXTENSION vector;
三、 简单用法
1. 建表与增删改
Create a new table with a vector column
CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
add a vector column to an existing table
ALTER TABLE items ADD COLUMN embedding vector(3);
Insert vectors
INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');
load vectors in bulk using COPY
COPY items (embedding) FROM STDIN WITH (FORMAT BINARY);
Upsert vectors
INSERT INTO items (id, embedding) VALUES (1, '[1,2,3]'), (2, '[4,5,6]')
ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding;
Update vectors
UPDATE items SET embedding = '[1,2,3]' WHERE id = 1;
Delete vectors
DELETE FROM items WHERE id = 1;
2. 向量查询
Get the nearest neighbors to a vector
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
Supported distance functions are:
<->
- L2 distance<#>
- (negative) inner product<=>
- cosine distance<+>
- L1 distance<~>
- Hamming distance (binary vectors)<%>
- Jaccard distance (binary vectors)
Get the nearest neighbors to a row
SELECT * FROM items WHERE id != 1 ORDER BY embedding <-> (SELECT embedding FROM items WHERE id = 1) LIMIT 5;
Get rows within a certain distance
SELECT * FROM items WHERE embedding <-> '[3,1,2]' < 5;
Note: Combine with ORDER BY
and LIMIT
to use an index
3. 距离查询
Get the distance
SELECT embedding <-> '[3,1,2]' AS distance FROM items;
For inner product, multiply by -1 (since <#>
returns the negative inner product)
SELECT (embedding <#> '[3,1,2]') * -1 AS inner_product FROM items;
For cosine similarity, use 1 - cosine distance
SELECT 1 - (embedding <=> '[3,1,2]') AS cosine_similarity FROM items;
4. 聚合查询
Average vectors
SELECT AVG(embedding) FROM items;
Average groups of vectors
SELECT category_id, AVG(embedding) FROM items GROUP BY category_id;
四、核心索引
pgvector支持两类索引,即前面提到过的两类算法 —— HNSW(默认) 与 IVFFlat。
1. HNSW索引
HNSW索引会构建一个多层图结构。相比IVFFlat索引,它具有更优的查询性能(在搜索速度与召回率的权衡方面),但构建时间更长且内存占用更高。此外,由于不需要像IVFFlat那样的训练步骤,该索引可以在表内尚无数据时直接创建。
创建各类距离索引
L2 distance
CREATE INDEX ON items USING hnsw (embedding vector_l2_ops);
Note: Use halfvec_l2_ops
for halfvec
and sparsevec_l2_ops
for sparsevec
(and similar with the other distance functions)
Inner product
CREATE INDEX ON items USING hnsw (embedding vector_ip_ops);
Cosine distance
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops);
L1 distance
CREATE INDEX ON items USING hnsw (embedding vector_l1_ops);
Hamming distance
CREATE INDEX ON items USING hnsw (embedding bit_hamming_ops);
Jaccard distance
CREATE INDEX ON items USING hnsw (embedding bit_jaccard_ops);
Supported types are:
vector
- up to 2,000 dimensionshalfvec
- up to 4,000 dimensionsbit
- up to 64,000 dimensionssparsevec
- up to 1,000 non-zero elements
2. IVFFlat索引
IVFFlat索引的工作原理是将向量划分为多个聚类列表,仅搜索距离查询向量最近的若干列表。相较于HNSW索引,它具有更快的构建速度和更低的内存占用,但在查询性能(速度与召回率的平衡)方面表现稍逊。
IVFFlat索引实现高召回率需把握三个关键要点:
-
数据准备:建议在表中已存有数据后再创建索引
-
列表数量设定:
-
数据量≤100万条时,建议初始值为行数/1000
-
数据量>100万条时,建议采用行数的平方根值
-
-
查询调优:设置合适的探测数量(probes参数)
-
较高值提升召回率,较低值加快查询速度
-
建议初始值为列表数量的平方根值
-
创建各类距离索引
L2 distance
CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);
Note: Use halfvec_l2_ops
for halfvec
(and similar with the other distance functions)
Inner product
CREATE INDEX ON items USING ivfflat (embedding vector_ip_ops) WITH (lists = 100);
Cosine distance
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
Hamming distance
CREATE INDEX ON items USING ivfflat (embedding bit_hamming_ops) WITH (lists = 100);
Supported types are:
vector
- up to 2,000 dimensionshalfvec
- up to 4,000 dimensionsbit
- up to 64,000 dimensions
五、 其他向量及索引
半精度向量(Half-Precision Vectors)
-
存储半精度向量:
CREATE TABLE items (id bigserial PRIMARY KEY, embedding halfvec(3));
-
半精度索引:
CREATE INDEX ON items USING hnsw ((embedding::halfvec(3)) halfvec_l2_ops);
-
最近邻查询:
SELECT * FROM items ORDER BY embedding::halfvec(3) <-> '[1,2,3]' LIMIT 5;
二进制向量(Binary Vectors)
-
存储二进制向量:
CREATE TABLE items (id bigserial PRIMARY KEY, embedding bit(3)); INSERT INTO items (embedding) VALUES ('000'), ('111');
-
汉明距离查询:
SELECT * FROM items ORDER BY embedding <~> '101' LIMIT 5;
-
Jaccard 距离支持:
SELECT * FROM items ORDER BY embedding <%> '101' LIMIT 5;
稀疏向量(Sparse Vectors)
-
存储稀疏向量(格式:
{索引:值,...}/维度
,索引从 1 开始):sql
复制
CREATE TABLE items (id bigserial PRIMARY KEY, embedding sparsevec(5)); INSERT INTO items (embedding) VALUES ('{1:1,3:2,5:3}/5'), ('{1:4,3:5,5:6}/5');
-
L2 距离查询:
SELECT * FROM items ORDER BY embedding <-> '{1:3,3:1,5:2}/5' LIMIT 5;
参考
部分内容来自AI回答
https://github.com/pgvector/pgvector?tab=readme-ov-file#hnsw