1️⃣k-Meansk\text{-Means}k-Means分簇方法
-
含义:一种无监督学习,用于将数据集分为kkk个簇(每簇一个质心),使同簇点靠近/异簇点远离
-
流程:
2️⃣PQ\text{PQ}PQ算法流程
-
给定kkk个DDD维向量
{v1=[x11,x12,x13,x14,...,x1D]v2=[x21,x22,x23,x24,...,x2D] .........vk=[xk1,xk2,xk3,xk4,...,xaD]↔{v1={[x11,x12,x13],[x14,x15,x16],...,[x1(D−1),x1(D−1),x1D]}v2={[x21,x22,x23],[x24,x25,x26],...,[x2(D−1),x2(D−1),x2D]} .........vk={[xk1,xk2,xk3],[xk4,xk5,xk6],...,[xk(D−1),xk(D−1),xkD]}\small\begin{cases} \textbf{v}_1=[x_{11},x_{12},x_{13},x_{14},...,x_{1D}]\\\\ \textbf{v}_2=[x_{21},x_{22},x_{23},x_{24},...,x_{2D}]\\\\ \,\,\,\,\,\,\,\,\,\,\,\,.........\\\\ \textbf{v}_k=[x_{k1},x_{k2},x_{k3},x_{k4},...,x_{aD}] \end{cases}\xleftrightarrow{} \begin{cases} \textbf{v}_{1}=\{[x_{11},x_{12},x_{13}],[x_{14},x_{15},x_{16}],...,[x_{1(D-1)},x_{1(D-1)},x_{1D}]\}\\\\ \textbf{v}_{2}=\{[x_{21},x_{22},x_{23}],[x_{24},x_{25},x_{26}],...,[x_{2(D-1)},x_{2(D-1)},x_{2D}]\}\\\\ \,\,\,\,\,\,\,\,\,\,\,\,.........\\\\ \textbf{v}_{k}=\{[x_{k1},x_{k2},x_{k3}],[x_{k4},x_{k5},x_{k6}],...,[x_{k(D-1)},x_{k(D-1)},x_{kD}]\} \end{cases}⎩⎨⎧v1=[x11,x12,x13,x14,...,x1D]v2=[x21,x22,x23,x24,...,x2D].........vk=[xk1,xk2,xk3,xk4,...,xaD]⎩⎨⎧v1={[x11,x12,x13],[x14,x15,x16],...,[x1(D−1),x1(D−1),x1D]}v2={[x21,x22,x23],[x24,x25,x26],...,[x2(D−1),x2(D−1),x2D]}.........vk={[xk1,xk2,xk3],[xk4,xk5,xk6],...,[xk(D−1),xk(D−1),xkD]}
-
分割子空间:将DDD维向量分为MMM个DM\cfrac{D}{M}MD维向量
子空间1{v11=[x11,x12,x13]v21=[x21,x22,x23] .........vk1=[xk1,xk2,xk3]&子空间2{v12=[x14,x15,x16]v22=[x24,x25,x26] .........vk2=[xk4,xk5,xk6]&...&子空间M{v1M=[x1(D−1),x1(D−1),x1D]v2M=[x2(D−1),x2(D−1),x2D] .........vkM=[xk(D−1),xk(D−1),xkD]\small子空间1\begin{cases} \textbf{v}_{11}=[x_{11},x_{12},x_{13}]\\\\ \textbf{v}_{21}=[x_{21},x_{22},x_{23}]\\\\ \,\,\,\,\,\,\,\,\,\,\,\,.........\\\\ \textbf{v}_{k1}=[x_{k1},x_{k2},x_{k3}] \end{cases}\&子空间2 \begin{cases} \textbf{v}_{12}=[x_{14},x_{15},x_{16}]\\\\ \textbf{v}_{22}=[x_{24},x_{25},x_{26}]\\\\ \,\,\,\,\,\,\,\,\,\,\,\,.........\\\\ \textbf{v}_{k2}=[x_{k4},x_{k5},x_{k6}] \end{cases}\&...\&子空间M \begin{cases} \textbf{v}_{1M}=[x_{1(D-1)},x_{1(D-1)},x_{1D}]\\\\ \textbf{v}_{2M}=[x_{2(D-1)},x_{2(D-1)},x_{2D}]\\\\ \,\,\,\,\,\,\,\,\,\,\,\,.........\\\\ \textbf{v}_{kM}=[x_{k(D-1)},x_{k(D-1)},x_{kD}] \end{cases}子空间1⎩⎨⎧v11=[x11,x12,x13]v21=[x21,x22,x23].........vk1=[xk1,xk2,xk3]&子空间2⎩⎨⎧v12=[x14,x15,x16]v22=[x24,x25,x26].........vk2=[xk4,xk5,xk6]&...&子空间M⎩⎨⎧v1M=[x1(D−1),x1(D−1),x1D]v2M=[x2(D−1),x2(D−1),x2D].........vkM=[xk(D−1),xk(D−1),xkD]
-
生成PQ\text{PQ}PQ编码:
子空间1{v11←替代Centriod11v21←替代Centriod21 .........vk1←替代Centriodk1&子空间2{v12←替代Centriod12v22←替代Centriod22 .........vk2←替代Centriodk2&...&子空间M{v1M←替代Centriod1Mv2M←替代Centriod2M .........vkM←替代CentriodkM\small子空间1\begin{cases} \textbf{v}_{11}\xleftarrow{替代}\text{Centriod}_{11}\\\\ \textbf{v}_{21}\xleftarrow{替代}\text{Centriod}_{21}\\\\ \,\,\,\,\,\,\,\,\,\,\,\,.........\\\\ \textbf{v}_{k1}\xleftarrow{替代}\text{Centriod}_{k1} \end{cases}\&子空间2 \begin{cases} \textbf{v}_{12}\xleftarrow{替代}\text{Centriod}_{12}\\\\ \textbf{v}_{22}\xleftarrow{替代}\text{Centriod}_{22}\\\\ \,\,\,\,\,\,\,\,\,\,\,\,.........\\\\ \textbf{v}_{k2}\xleftarrow{替代}\text{Centriod}_{k2} \end{cases}\&...\&子空间M \begin{cases} \textbf{v}_{1M}\xleftarrow{替代}\text{Centriod}_{1M}\\\\ \textbf{v}_{2M}\xleftarrow{替代}\text{Centriod}_{2M}\\\\ \,\,\,\,\,\,\,\,\,\,\,\,.........\\\\ \textbf{v}_{kM}\xleftarrow{替代}\text{Centriod}_{kM} \end{cases}子空间1⎩⎨⎧v11替代Centriod11v21替代Centriod21.........vk1替代Centriodk1&子空间2⎩⎨⎧v12替代Centriod12v22替代Centriod22.........vk2替代Centriodk2&...&子空间M⎩⎨⎧v1M替代Centriod1Mv2M替代Centriod2M.........vkM替代CentriodkM
-
聚类:在每个子空间上运行k-Meansk\text{-Means}k-Means算法(一般k=256k\text{=}256k=256)→\to→每个vij\textbf{v}_{ij}vij都会分到一个DM\cfrac{D}{M}MD维的质心
-
编码:将每个子向量vij\textbf{v}_{ij}vij所属质心的索引作为其PQ\text{PQ}PQ编码,并替代原有子向量
-
-
生成最终的压缩向量→{v1~={Centriod11,Centriod12,...,Centriod1M}v2~={Centriod21,Centriod22,...,Centriod2M} .........vk~={Centriodk1,Centriodk2,...,CentriodkM}\small\to\begin{cases} \widetilde{\textbf{v}_{1}}=\{\text{Centriod}_{11},\text{Centriod}_{12},...,\text{Centriod}_{1M}\}\\\\ \widetilde{\textbf{v}_{2}}=\{\text{Centriod}_{21},\text{Centriod}_{22},...,\text{Centriod}_{2M}\}\\\\ \,\,\,\,\,\,\,\,\,\,\,\,.........\\\\ \widetilde{\textbf{v}_{k}}=\{\text{Centriod}_{k1},\text{Centriod}_{k2},...,\text{Centriod}_{kM}\} \end{cases}→⎩⎨⎧v1={Centriod11,Centriod12,...,Centriod1M}v2={Centriod21,Centriod22,...,Centriod2M}.........vk={Centriodk1,Centriodk2,...,CentriodkM}
- 存储阶段:存储的内容实质上是质心索引,每个向量只占用MMM维
- 使用阶段:所有的质心索引被解压为质心,每个向量维度又恢复M×DM=DM\text{×}\cfrac{D}{M}\text{=}DM×MD=D维
3️⃣IVF+PQ\text{IVF+PQ}IVF+PQ原理
- 离线索引阶段:
- 构建IVF\text{IVF}IVF:使用K-Menas\text{K-Menas}K-Menas将原始向量集合划分为nnn簇(即nnn个质心)
- 簇内压缩:对每个簇执行PQ\text{PQ}PQ压缩,即将每个簇内向量替换为质心索引
- 在线查询阶段:
- IVF\text{IVF}IVF部分:计算与查询qqq与所有簇质心的距离,由此选定前nproben_{\text{probe}}nprobe个簇的所有向量
- PQ\text{PQ}PQ部分:由质心索引还原选定向量→\text{→}→计算qqq之的距离(遍历子空间)dist(q,v)≈∑i=1Mdist(qi,cji)→\displaystyle{}\text{dist}(q, v) \text{≈} \sum_{i=1}^M \text{dist}\left(q_i, c_{j i}\right)\text{→}dist(q,v)≈i=1∑Mdist(qi,cji)→返回最近邻
915

被折叠的 条评论
为什么被折叠?



