【Faiss】简介及示例，索引类型

最新推荐文章于 2025-10-30 12:29:14 发布

转载最新推荐文章于 2025-10-30 12:29:14 发布 · 4.5k 阅读

5 ·

CC 4.0 BY-SA版权

原文链接：https://blog.youkuaiyun.com/kanbuqinghuanyizhang/article/details/80774609

Python 同时被 3 个专栏收录

368 篇文章

订阅专栏

图像处理

155 篇文章

订阅专栏

C++

120 篇文章

订阅专栏

本文详细介绍了Faiss向量搜索库中的各种索引类型，包括基于L2距离和内积的精确搜索，以及近似搜索算法如Hierarchical Navigable Small World graph exploration和Locality-Sensitive Hashing。此外，还深入探讨了量化索引，如Scalar Quantizer和Product Quantizer，以及如何通过复合索引和预处理技术提升搜索效率。

简单示例请参考网址https://blog.youkuaiyun.com/kanbuqinghuanyizhang/article/details/80774609

sift数据集下载：http://corpus-texmex.irisa.fr/

下面为介绍数据索引类型

Faiss 索引类型：

Exact Search for L2 #基于L2距离的确定搜索匹配
Exact Search for Inner Product #基于内积的确定搜索匹配
Hierarchical Navigable Small World graph exploration #分层索引
Inverted file with exact post-verification #倒排索引
Locality-Sensitive Hashing (binary flat index) #本地敏感hash
Scalar quantizer (SQ) in flat mode #标量量化索引
Product quantizer (PQ) in flat mode #笛卡尔乘积索引
IVF and scalar quantizer #倒排+标量量化索引
IVFADC (coarse quantizer+PQ on residuals) #倒排+笛卡尔乘积索引
IVFADC+R (same as IVFADC with re-ranking based on codes) #倒排+笛卡尔乘积索引 + 基于编码器重排

索引名	类名	index_factory	主要参数	字节数/向量	精准检索	备注
精准的L2搜索	IndexFlatL2	"Flat"	d	4*d	yes	brute-force
精准的内积搜索	IndexFlatIP	"Flat"	d	4*d	yes	归一化向量计算cos
Hierarchical Navigable Small World graph exploration	IndexHNSWFlat	"HNSWx,Flat"	d, M	4d + 8 M	no	-
倒排文件	IndexIVFFlat	"IVFx,Flat"	quantizer, d, nlists, metric	4*d	no	需要另一个量化器来建立倒排
Locality-Sensitive Hashing (binary flat index)	IndexLSH	-	d, nbits	nbits/8	yes	optimized by using random rotation instead of random projections
Scalar quantizer (SQ) in flat mode	IndexScalarQuantizer	"SQ8"	d	d	yes	每个维度项可以用4 bit表示，但是精度会受到一定影响
Product quantizer (PQ) in flat mode	IndexPQ	"PQx"	d, M, nbits	M (if nbits=8)	yes	-
IVF and scalar quantizer	IndexIVFScalarQuantizer	"IVFx,SQ4" "IVFx,SQ8"	quantizer, d, nlists, qtype	d or d/2	no	有两种编码方式：每个维度项4bit或8bit
IVFADC (coarse quantizer+PQ on residuals)	IndexIVFPQ	"IVFx,PQy"	quantizer, d, nlists, M, nbits	M+4 or M+8	no	内存和数据id（int、long）相关，目前只支持 nbits <= 8
IVFADC+R (same as IVFADC with re-ranking based on codes)	IndexIVFPQR	"IVFx,PQy+z"	quantizer, d, nlists, M, nbits, M_refine, nbits_refine	M+M_refine+4 or M+M_refine+8	no	-

Cell-probe方法

加速查找的典型方法是对数据集进行划分，我们采用了基于Multi-probing(best-bin KD树变体)的分块方法。

特征空间被切分为ncells个块
数据被划分到这些块中（k-means可根据最近欧式距离），归属关系存储在ncells个节点的倒排列表中
搜索时，检索离目标距离最近的nprobe个块
根据倒排列表检索nprobe个块中的所有数据。

这便是IndexIVFFlat，它需要另一个索引来记录倒排列表。

IndexIVFKmeans 和 IndexIVFSphericalKmeans 不是对象而是方法，它们可以返回IndexIVFFlat对象。

注意：对于高维的数据，要达到较好的召回，需要的nprobes可能很大

和LSH的关系

最流行的cell-probe方法可能是原生的LSH方法，可参考E2LSH。然而，这个方法及其变体有两大弊端：

需要大量的哈希函数（=分块数），来达到可以接受的结果
哈希函数很难基于输入动态调整，实际应用中容易返回次优结果

LSH的示例

n_bits = 2 * d
lsh = faiss.IndexLSH (d, n_bits)
lsh.train (x_train)
lsh.add (x_base)
D, I = lsh.search (x_query, k)

d是输入数据的维度，nbits是存储向量的bits数目。

PQ的示例

m = 16                                   # number of subquantizers
n_bits = 8                               # bits allocated per subquantizer
pq = faiss.IndexPQ (d, m, n_bits)        # Create the index
pq.train (x_train)                       # Training
pq.add (x_base)                          # Populate the index
D, I = pq.search (x_query, k)            # Perform a search

带倒排的PQ：IndexIVFPQ

coarse_quantizer = faiss.IndexFlatL2 (d)
index = faiss.IndexIVFPQ (coarse_quantizer, d,
                          ncentroids, m, 8)
index.nprobe = 5

复合索引

使用PQ作粗粒度量化器的Cell Probe方法

相应的文章见：The inverted multi-index, Babenko & Lempitsky, CVPR'12。在Faiss中可使用MultiIndexQuantizer，它不需要add任何向量，因此将它应用在IndexIVF时需要设置quantizer_trains_alone。

nbits_mi = 12  # c
M_mi = 2       # m
coarse_quantizer_mi = faiss.MultiIndexQuantizer(d, M_mi, nbits_mi)
ncentroids_mi = 2 ** (M_mi * nbits_mi)

index = faiss.IndexIVFFlat(coarse_quantizer_mi, d, ncentroids_mi)
index.nprobe = 2048
index.quantizer_trains_alone = True

预过滤PQ编码，汉明距离的计算比PQ距离计算快6倍，通过对PQ中心的合理重排序，汉明距离可以正确地替代PQ编码距离。在搜索时设置汉明距离的阈值，可以避免PQ比较的大量运算。

# IndexPQ
index = faiss.IndexPQ (d, 16, 8)
# before training
index.do_polysemous_training = true
index.train (...)

# before searching
index.search_type = faiss.IndexPQ.ST_polysemous
index.polysemous_ht = 54    # the Hamming threshold
index.search (...)

# IndexIVFPQ
index = faiss.IndexIVFPQ (coarse_quantizer, d, 16, 8)
# before training
index. do_polysemous_training = true
index.train (...)

# before searching
index.polysemous_ht = 54 # the Hamming threshold
index.search (...)

阈值设定是注意两点：

阈值在0到编码bit数（16*8）之间
阈值越小，留下的需要计算的PQ中心数越少，推荐<1/2*bits

复合索引中也可以建立多级PQ量化索引。

预处理和后处理

为了获得更好的索引，可以remap向量ids，对数据集进行变换，re-rank检索结果等。

Faiss id mapping

默认情况下，Faiss为每个向量设置id。有些Index实现了add_with_ids方法，为向量添加64bit的ids，检索时返回ids而不需返回原始向量。

index = faiss.IndexFlatL2(xb.shape[1]) 
ids = np.arange(xb.shape[0])
index.add_with_ids(xb, ids)  # this will crash, because IndexFlatL2 does not support add_with_ids
index2 = faiss.IndexIDMap(index)
index2.add_with_ids(xb, ids) # works, the vectors are stored in the underlying index

IndexIVF原生提供了ass_with_ids方法，就不需要IndexIDMap了。

预变换

变换方法	类名	备注
random rotation	RandomRotationMatrix	useful to re-balance components of a vector before indexing in an IndexPQ or IndexLSH
remapping of dimensions	RemapDimensionsTransform	为适应索引推荐的维度，通过重排列减少或增加向量维度d
PCA	PCAMatrix	降维
OPQ rotation	OPQMatrix	OPQ通过旋转输入向量更利于PQ编码，见 Optimized product quantization, Ge et al., CVPR'13

换行可以通过train进行训练，通过apply应用到数据上。这些变化可以通过IndexPreTransform方法应用到索引上。

# the IndexIVFPQ will be in 256D not 2048
coarse_quantizer = faiss.IndexFlatL2 (256)
sub_index = faiss.IndexIVFPQ (coarse_quantizer, 256, ncoarse, 16, 8)
# PCA 2048->256
# also does a random rotation after the reduction (the 4th argument)
pca_matrix = faiss.PCAMatrix (2048, 256, 0, True) 

#- the wrapping index
index = faiss.IndexPreTransform (pca_matrix, sub_index)

# will also train the PCA
index.train(...)
# PCA will be applied prior to addition
index.add(...)

IndexRefineFlat

对搜索结果进行精准重排序

q = faiss.IndexPQ (d, M, nbits_per_index)
rq = faiss.IndexRefineFlat (q)
rq.train (xt)
rq.add (xb)
rq.k_factor = 4
D, I = rq:search (xq, 10)

从IndexPQ的最近4*10个邻域中，计算真实距离，返回最好的10个结果。注意IndexRefineFlat需要积累全向量，占用内存较高。

IndexShards

如果数据分开为多个索引，查询时需要合并结果集。这在多GPU以及平行查询中是必需的。

索引的I/O与复制

所有的函数都是深复制，我们不需要关心对象关系。

I/O函数：

write_index(index, "large.index"): 写索引到文件
Index * index = read_index("large.index") 读索引

复制函数：

Index* index2 = clone_index(index): 返回索引的深复制
Index *index_cpu_to_gpu = index_cpu_to_gpu(resource, dev_no, index): 复制索引到GPU
Index *index_gpu_to_cpu = index_gpu_to_cpu(index):从GPU到CPU
index_cpu_to_gpu_multiple: uses an IndexShards or IndexProxy to copy the index to several GPUs.

index_factory

index_factory通过字符串来创建索引，字符串包括三部分：预处理、倒排、编码。
预处理支持：

PCA：PCA64表示通过PCA降维到64维（PCAMatrix实现）;PCAR64表示PCA后添加一个随机旋转。
OPQ：OPQ16表示为数据集进行16字节编码进行预处理（OPQMatrix实现），对PQ索引很有效但是训练时也会慢一些。

倒排支持：

IVF：IVF4096表示使用粗量化器IndexFlatL2将数据分为4096份
IMI：IMI2x8表示通过Mutil-index使用2x8个bits（MultiIndexQuantizer）建立2^(2*8)份的倒排索引。
IDMap：如果不使用倒排但需要add_with_ids，可以通过IndexIDMap来添加id

编码支持：

Flat：存储原始向量，通过IndexFlat或IndexIVFFlat实现
PQ：PQ16使用16个字节编码向量，通过IndexPQ或IndexIVFPQ实现
PQ8+16：表示通过8字节来进行PQ，16个字节对第一级别量化的误差再做PQ，通过IndexIVFPQR实现

如：
index = index_factory(128, "OPQ16_64,IMI2x8,PQ8+16"): 处理128维的向量，使用OPQ来预处理数据16是OPQ内部处理的blocks大小，64为OPQ后的输出维度；使用multi-index建立65536（2^16）和倒排列表；编码采用8字节PQ和16字节refine的Re-rank方案。

OPQ是非常有效的，除非原始数据就具有block-wise的结构如SIFT。

自动调参

索引的参数包括两种：bulid-time索引创建时需要设置的、run-time在搜索前可以调整的。针对run-time参数可以进行Auto-tuning。

Key	类名	run-time参数	备注
IVF, IMI2x	IndexIVF*	nprobe	控制速度和精度的折中
IMI2x*	IndexIVF	max_codes	平衡倒排列表
PQ*	IndexIVFPQ, IndexPQ	ht	Hamming threshold for polysemous
PQ+	IndexIVFPQR	k_factor	Re-rank时要核实的数据量

AutoTuneCriterion：包含ground-truth，使用搜索结果，评估召回；OperatingPoints：包含（性能，时间，参数集合id），目标是找到最优的operating point——没有其他point可以在更短的时间内达到更好的性能；ParameterSpace：参数空间是指数级的，但是这些参数有一个共同的特性，值越高一般来说速度越慢，性能越好。

faiss/tests/demo_sift1M.cpp中有一个自动调参的示例。自动调参依赖于：评测集合完备且充足，机器环境稳定。

特殊的操作

根据索引重建数据，见test_index_composite.py
支持IndexFlat, IndexIVFFlat (call make_direct_map first), IndexIVFPQ (same), IndexPreTransform (provided the underlying transform supports it)
从索引中移除元素，remove_ids方法
见test_index_composite.py，支持IndexFlat, IndexIVFFlat, IndexIVFPQ, IDMap
范围查找，range_search方法
将返回离查询点一定半径内的向量，在Python中它将返回一个1D元组lims/D/I，针对第i个的查询结果为I[lims[i]:lims[i+1]], D[lims[i]:lims[i+1]]，支持IndexFlat, IndexIVFFlat
合并切分索引
merge_from合并其他索引，copy_subset_to复制当前索引的子集到其他索引，支持IndexIVF

https://www.cnblogs.com/houkai/p/9316155.html

https://www.cnblogs.com/houkai/p/9316172.html

【Faiss】简介及示例，索引类型

Cell-probe方法

和LSH的关系

复合索引

预处理和后处理

Faiss id mapping

预变换

IndexRefineFlat

IndexShards

索引的I/O与复制

index_factory

自动调参

特殊的操作

1 条评论