Faiss(8)：IVFPQ-Add过程分析

最新推荐文章于 2025-02-28 18:55:39 发布

原创最新推荐文章于 2025-02-28 18:55:39 发布 · 1.5k 阅读

2 ·

CC 4.0 BY-SA版权

Faiss 专栏收录该内容

17 篇文章

订阅专栏

本文详细解析了IVF-PQ索引构建的过程，包括关键步骤如量化器获取最近邻向量、计算残差、量化向量及将向量添加到反向列表等，并记录了添加100万向量的运行时间。

1. 说明

已经完成训练的向量仍然只是一个空壳子，还需要向里面添加dataset包含的向量，才能用于后期的搜索。

2. 过程分析

2.1 app

for n, s in enumerate(range(4010,4010+enu*10,10)):
    np.random.seed(s)
    xb = np.random.random((seg,d)).astype('float32')
    xb[:,0] += (np.arange(seg)+n*seg) * 1.0/nb
    index.add(xb)
    del xb

这里分十次，每次将100000个64维浮点型向量添加到已训练好的索引index中，调用函数为index.add(xb)。

2.2 faiss core

add()函数

void IndexIVF::add (idx_t n, const float * x)
{
    add_with_ids (n, x, nullptr);
}

n: 数据集中向量个数，这里是100000
x: 数据集的首地址

IndexIVFPQ没有对add函数进行重定义，所以直接使用父类IndexIVF定义的add函数。

add_with_ids()函数

void IndexIVFPQ::add_with_ids (idx_t n, const float * x, const idx_t *xids)
{
    add_core_o (n, x, xids, nullptr);
}

这个函数在IndexIVFPQ有重定义，所以调用该类下的实现。

add_core_o()函数

void IndexIVFPQ::add_core_o (idx_t n, const float * x, const idx_t *xids,
                             float *residuals_2, const idx_t *precomputed_idx)
{

    idx_t bs = 32768;
    if (n > bs) {
        for (idx_t i0 = 0; i0 < n; i0 += bs) {
            idx_t i1 = std::min(i0 + bs, n);
            if (verbose) {
                printf("IndexIVFPQ::add_core_o: adding %ld:%ld / %ld\n",
                       i0, i1, n);
            }
            add_core_o (i1 - i0, x + i0 * d,
                        xids ? xids + i0 : nullptr,
                        residuals_2 ? residuals_2 + i0 * d : nullptr,
                        precomputed_idx ? precomputed_idx + i0 : nullptr);
        }
        return;
    }

    InterruptCallback::check();

    FAISS_THROW_IF_NOT (is_trained);
    double t0 = getmillisecs ();
    const idx_t * idx;
    ScopeDeleter<idx_t> del_idx;

    if (precomputed_idx) {
        idx = precomputed_idx;
    } else {
        idx_t * idx0 = new idx_t [n];
        del_idx.set (idx0);
        // 返回x中的1个最近邻到idx0
        quantizer->assign (n, x, idx0);
        idx = idx0;
    }

    double t1 = getmillisecs ();
    uint8_t * xcodes = new uint8_t [n * code_size];
    ScopeDeleter<uint8_t> del_xcodes (xcodes);

    const float *to_encode = nullptr;
    ScopeDeleter<float> del_to_encode;

    if (by_residual) {
        // 计算残差
        to_encode = compute_residuals (quantizer, n, x, idx);
        del_to_encode.set (to_encode);
    } else {
        to_encode = x;
    }
    // 量化向量
    pq.compute_codes (to_encode, xcodes, n);

    double t2 = getmillisecs ();
    // TODO: parallelize?
    size_t n_ignore = 0;
    for (size_t i = 0; i < n; i++) {
        idx_t key = idx[i];
        if (key < 0) {
            n_ignore ++;
            if (residuals_2)
                memset (residuals_2, 0, sizeof(*residuals_2) * d);
            continue;
        }
        idx_t id = xids ? xids[i] : ntotal + i;

        uint8_t *code = xcodes + i * code_size;
        size_t offset = invlists->add_entry (key, id, code);

        if (residuals_2) {
            float *res2 = residuals_2 + i * d;
            const float *xi = to_encode + i * d;
            //解码向量
            pq.decode (code, res2);
            for (int j = 0; j < d; j++)
                res2[j] = xi[j] - res2[j];
        }

        if (maintain_direct_map)
            direct_map.push_back (key << 32 | offset);
    }


    double t3 = getmillisecs ();
    if(verbose) {
        char comment[100] = {0};
        if (n_ignore > 0)
            snprintf (comment, 100, "(%ld vectors ignored)", n_ignore);
        printf(" add_core times: %.3f %.3f %.3f %s\n",
               t1 - t0, t2 - t1, t3 - t2, comment);
    }
    ntotal += n;
}

residuals_2：输出第二级残差，默认为nullptr
precomputed_idx：使用与计算，默认为nullptr

bs记录了一次添加的最大向量数量为32768，超过该数量时分段添加。

最后分别记录三个时间：

t1 - t0: 量化器获取最近邻向量的时间；
t2 - t1: 计算残差和编码向量的时间；
t3 - t2: 将向量添加到反向列表的时间；

2.3 代码总结

从整个代码流程中可以分析到，IVFPQ索引每次最多将32768个向量添加到索引中。

添加索引的关键步骤如下：

quantizer->assign
从32768个向量中找出所有向量的key值，存放在32768个元素的数组idx中，对于key值小于0的向量不添加到索引中；
compute_residuals
调用quantizer->compute_residual计算向量的残差，返回一个n*d维向量组的指针；
pq.compute_codes
对向量的进行量化，这里主要是对传入的向量的值按照要求的编码格式进行编码；
invlists->add_entry
将量化后的向量添加的倒序列表中，这里的倒序是按照key值的大小倒序排列的。

3. 运行记录

IVF-PQ adding...
IndexIVFPQ::add_core_o: adding 0:32768 / 100000
 add_core times: 391.784 78.603 1.859 
IndexIVFPQ::add_core_o: adding 32768:65536 / 100000
 add_core times: 423.547 87.102 1.617 
IndexIVFPQ::add_core_o: adding 65536:98304 / 100000
 add_core times: 415.935 78.211 1.512 
IndexIVFPQ::add_core_o: adding 98304:100000 / 100000
 add_core times: 87.510 31.999 0.134 
IndexIVFPQ::add_core_o: adding 0:32768 / 100000
 add_core times: 486.040 71.350 1.506 
IndexIVFPQ::add_core_o: adding 32768:65536 / 100000
 add_core times: 526.438 61.646 1.508 
IndexIVFPQ::add_core_o: adding 65536:98304 / 100000
 add_core times: 409.483 80.746 1.461 
IndexIVFPQ::add_core_o: adding 98304:100000 / 100000
 add_core times: 129.774 37.277 0.093 
IndexIVFPQ::add_core_o: adding 0:32768 / 100000
 add_core times: 457.254 62.888 1.524 
IndexIVFPQ::add_core_o: adding 32768:65536 / 100000
 add_core times: 387.633 76.429 1.485 
IndexIVFPQ::add_core_o: adding 65536:98304 / 100000
 add_core times: 333.967 74.983 1.482 
IndexIVFPQ::add_core_o: adding 98304:100000 / 100000
 add_core times: 95.467 29.619 0.096 
IndexIVFPQ::add_core_o: adding 0:32768 / 100000
 add_core times: 524.694 85.533 1.521 
IndexIVFPQ::add_core_o: adding 32768:65536 / 100000
 add_core times: 479.922 81.707 1.527 
IndexIVFPQ::add_core_o: adding 65536:98304 / 100000
 add_core times: 485.752 78.159 1.451 
IndexIVFPQ::add_core_o: adding 98304:100000 / 100000
 add_core times: 100.194 43.996 0.089 
IndexIVFPQ::add_core_o: adding 0:32768 / 100000
 add_core times: 499.971 83.684 1.510 
IndexIVFPQ::add_core_o: adding 32768:65536 / 100000
 add_core times: 430.709 68.070 1.396 
IndexIVFPQ::add_core_o: adding 65536:98304 / 100000
 add_core times: 464.378 62.486 1.429 
IndexIVFPQ::add_core_o: adding 98304:100000 / 100000
 add_core times: 68.331 38.117 0.103 
IndexIVFPQ::add_core_o: adding 0:32768 / 100000
 add_core times: 378.162 98.596 1.546 
IndexIVFPQ::add_core_o: adding 32768:65536 / 100000
 add_core times: 530.907 78.745 1.497 
IndexIVFPQ::add_core_o: adding 65536:98304 / 100000
 add_core times: 676.605 76.913 1.553 
IndexIVFPQ::add_core_o: adding 98304:100000 / 100000
 add_core times: 157.492 31.997 0.121 
IndexIVFPQ::add_core_o: adding 0:32768 / 100000
 add_core times: 488.498 75.824 1.386 
IndexIVFPQ::add_core_o: adding 32768:65536 / 100000
 add_core times: 418.566 78.788 1.481 
IndexIVFPQ::add_core_o: adding 65536:98304 / 100000
 add_core times: 349.676 72.210 1.319 
IndexIVFPQ::add_core_o: adding 98304:100000 / 100000
 add_core times: 79.160 38.431 0.097 
IndexIVFPQ::add_core_o: adding 0:32768 / 100000
 add_core times: 360.658 93.961 1.895 
IndexIVFPQ::add_core_o: adding 32768:65536 / 100000
 add_core times: 348.438 77.360 1.424 
IndexIVFPQ::add_core_o: adding 65536:98304 / 100000
 add_core times: 588.783 47.309 1.398 
IndexIVFPQ::add_core_o: adding 98304:100000 / 100000
 add_core times: 51.267 33.624 0.115 
IndexIVFPQ::add_core_o: adding 0:32768 / 100000
 add_core times: 403.233 83.888 1.303 
IndexIVFPQ::add_core_o: adding 32768:65536 / 100000
 add_core times: 502.740 77.987 1.299 
IndexIVFPQ::add_core_o: adding 65536:98304 / 100000
 add_core times: 436.689 87.189 1.305 
IndexIVFPQ::add_core_o: adding 98304:100000 / 100000
 add_core times: 171.474 33.624 0.083 
IndexIVFPQ::add_core_o: adding 0:32768 / 100000
 add_core times: 559.980 67.325 1.337 
IndexIVFPQ::add_core_o: adding 32768:65536 / 100000
 add_core times: 431.339 103.965 1.462 
IndexIVFPQ::add_core_o: adding 65536:98304 / 100000
 add_core times: 473.232 85.807 1.676 
IndexIVFPQ::add_core_o: adding 98304:100000 / 100000
 add_core times: 197.662 31.670 0.129 
IVF-PQ add done! 18.32587718963623
IVF-PQ ntotal after adding:  1000000