matching pursuit笔记

最新推荐文章于 2021-08-02 11:56:17 发布

原创最新推荐文章于 2021-08-02 11:56:17 发布 · 351 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#match pursuit #compositional

机器学习同时被 2 个专栏收录

121 篇文章

订阅专栏

数学

38 篇文章

订阅专栏

本文详细介绍了非负正交匹配追踪（NOMP）算法，包括其基本原理、步骤以及与正交匹配追踪（OMP）的区别。NOMP通过约束组合系数为非负数来提高算法的稳定性和准确性。文章提供了NOMP的Python实现，并展示了如何逐步逼近输入特征向量的分解。此外，还提到了匹配追踪的其他变种，如标准的OMP和全逆向正交匹配追踪（full backward orthogonality）的概念。

给定一个 Hilbert space 中的向量 $f$ 和一个称为字典的向量组 $D=\{x_1,\dots,x_n\}$ ，且 $\parallel x_i \parallel=1$ ，想用 $D$ 中的某些元素的组合尽量逼近 $f$ ，反过来说，是将 $f$ 拆成 $D$ 中元素的组合。

Matching Pursuit

MP [1] 的思路是递归地增量逼近。当随便拿一个 $x_i$ 近似时，MP 只取 $f$ 在 $x_i$ 上正交投影，此时 f 可以拆成： $\begin{aligned} f&=\hat f + Rf \\ &=<f,x_i>x_i + Rf \end{aligned}$ 其中 $\hat f$ 是近似值， $<\cdot,\cdot>$ 是内积， $R f$ 表示 $\hat f$ 与 $f$ 之间的误差。
orthogonal projection

选定上述的最优 $x_i$ 之后，可以继续用 $D$ 拆误差 $R f$ ： $Rf=<Rf,x_{i'}>x_{i'}+R(Rf)$ 于是原来的近似可以修正成： $\begin{aligned} f &= <f,x_i>x_i + Rf \\ &=\underbrace{<f,x_i>x_i + <Rf,x_{i'}>x_{i'}}_{\hat f} + R(Rf) \end{aligned}$ 此时有 $\parallel Rf \parallel^2=|<Rf,x_{i'}>|^2+\parallel R(Rf) \parallel^2\geq\parallel R(Rf) \parallel^2$ 即近似误差更小了。这可以导出一个迭代算法，表示成： $f\approx\hat f_m=\sum_{i=1}^m<R_{i-1}f,x_{\sigma_i}>x_{\sigma_i}+R_mf$ 其中 $x_{\sigma_i}$ 是第 i 步选的向量， $R_if$ 是第 i 步的误差， $R_0f=f$ 。

记 $V=\overline{\text{span}\{x_1,\dots,x_n\}}$ 是 $D$ 张成的空间（closed linear span of the dictionary vectors），则当 $V = H$ 时， $f$ 必能被 $D$ 组合得出即 $\lim_{m\rightarrow + \infty}\hat f_m=f \\ \lim_{m\rightarrow + \infty}R_mf=0$

注意，MP 中每个 $x_i$ 允许多次被选中。

Orthogonal Matching Pursuit

OMP [2] 表示 MP 只保证 $R_if \perp x_{\sigma_i}$ ，想收敛可能需要很多步迭代。当限制迭代次数 $m$ 时，MP 得出的结果不是最优的。

OMP 增加了对 $R_if$ 的约束，使其与所有 $x_{\sigma_j}(j=1,\dots,i-1)$ 都正交（full backward orthogonality）。

假设在第 k 步的拆分为： $f=\sum_{i=1}^ka^k_ix_{\sigma_i}+R_kf \tag{1}$ 其中 $a^k_i$ 是第 k 步中的组合系数， $R_kf$ 满足 $R_kf\perp x_{\sigma_j},\quad j=1,\dots,k$ ，即 full backward orthogonality。在第 k + 1 步，按 MP 的方式选 $x_{\sigma_{k+1}}$ ，新的拆分为： $f=\sum_{i=1}^{k+1}a^{k+1}_ix_{\sigma_i}+R_{k+1}f \tag{2}$ 约束 $R_{k+1}f$ 同样满足 full backward orthogonality。为了确定系数 $a^{k+1}_i$ ，先将 $x_{\sigma_{k+1}}$ 同样用 $\{x_{\sigma_1},\dots,x_{\sigma_k}\}$ 拆成： $x_{\sigma_{k+1}}=\sum_{i=1}^kb^{k}_ix_{\sigma_i}+\gamma_k\tag{3}$ 其中 $b^k_i$ 和 $\gamma_k$ 都是用某种方法算出来的。 $\gamma_k$ 是 $x_{\sigma_{k+1}}$ 中用 $\{x_{\sigma_1},\dots,x_{\sigma_k}\}$ 组合不出的成分，所以 $\gamma_k\perp x_{\sigma_j},\quad j=1,\dots,k$ 。将 (3) 代入 (2) 与 (1) 对比系数： $\begin{aligned} f & =\underbrace{\sum_{i=1}^ka^k_ix_{\sigma_i}}_{1}+\underbrace{R_kf}_{2} \\ &=\sum_{i=1}^{k}a^{k+1}_ix_{\sigma_i}+a^{k+1}_{k+1}x_{\sigma_{k+1}}+R_{k+1}f \\ &=\sum_{i=1}^{k}a^{k+1}_ix_{\sigma_i}+a^{k+1}_{k+1}\left[\sum_{j=1}^kb^k_jx_{\sigma_j}+\gamma_k\right]+R_{k+1}f \\ &=\underbrace{\sum_{i=1}^{k}[a^{k+1}_i+a^{k+1}_{k+1}\cdot b^k_i]x_{\sigma_i}}_{1}+\underbrace{a^{k+1}_{k+1}\gamma_k+R_{k+1}f}_{2} \end{aligned}$ 所以 $a^{k+1}_i=a^k_i-a^{k+1}_{k+1}\cdot b^k_i \\ R_{k+1}f=R_kf-a^{k+1}_{k+1}\cdot\gamma_k$ 只要确定 $a^{k+1}_{k+1}$ 就可以。又由 $R_{k+1}f\perp x_{\sigma_{k+1}}$ 的限制，有： $\begin{aligned} 0&=<R_{k+1}f,x_{\sigma_{k+1}}> \\ &=<R_kf,x_{\sigma_{k+1}}>-a^{k+1}_{k+1}\cdot<\gamma_k,x_{\sigma_{k+1}}> \end{aligned}$ 所以 $a^{k+1}_{k+1}=\frac{<R_kf,x_{\sigma_{k+1}}>}{<\gamma_k,x_{\sigma_{k+1}}>}$ 。

这样在保证 $R_{k+1}f\perp x_{\sigma_{k+1}}$ 的限制的同时，因为 $R_kf\perp x_{\sigma_j}\wedge \gamma_k\perp x_{\sigma_j}$ ，所以 $R_{k+1}f=R_kf-\gamma_k\perp x_{\sigma_j},\quad j=1,\dots,k$ ，所以 $R_{k+1}f$ 也满足 full backward orthogonality。

Non-negative Orthogonal Matching Pursuit

NOMP [3] 限制组合系数非负，说是可以提高稳定性。

[4] 是一份参考实现。下面是我改写的版本，跟 [4] 对拍过，结果一致。

# nomp.py
import numpy as np
import scipy.io as sio
import scipy.linalg
import scipy.optimize
from sklearn.preprocessing import normalize


"""Non-negative Orthogonal Matching Pursuit

ref:
- Stable and efficient representation learning with nonnegativity constraints
- Supplementary Materials: Compositional Zero-Shot Learning via Fine-Grained Dense Feature Composition
- https://github.com/hbdat/neurIPS20_CompositionZSL/blob/main/omp/omp.py#L43
"""


def to_pos(X):
    """X -> [max(0, X), max(0, -X)]
    把含有负值的向量映射成全是非负的向量
    相当于用多一倍的维度代偿原本负值的空间
    """
    X_inv = - X
    X_pos = np.concatenate([
        np.where(X > 0, X, 0),
        np.where(X_inv > 0, X_inv, 0),
    ], axis=1)
    print(X_pos.shape)
    return X_pos


def cos(A, B=None):
    """cosine"""
    An = normalize(A, norm='l2', axis=1)
    if (B is None) or (B is A):
        return np.dot(An, An.T)
    Bn = normalize(B, norm='l2', axis=1)
    return np.dot(An, Bn.T)


def spherical_kmeans(X, k=-1):
    """select a subset for dictionary using spherical k-means
    X: [n, d]
    k: # of cluster

    - https://github.com/jasonlaska/spherecluster
    """
    from spherecluster import SphericalKMeans
    n = X.shape[0]
    if (k <= 0) or (k > n):
        k = max(1, n // 2)
    skm = SphericalKMeans(n_clusters=k)
    skm.fit(X)
    return skm.cluster_centers_


def norm2(x):
    """x: [d] vector"""
    return np.linalg.norm(x) / np.sqrt(len(x))


def NOMP(D, f, k=-1, non_neg=True, max_iter=200, tol_res=1e-3, tol_coef=1e-12, verbose=False):
    """Non-negative Orthogonal Matching Pursuit
    Input:
        D: [m, d], dictionary (assumed to have been l2-normalised)
        f: [d], feature to decompose
        k: max # of selected component from dictionary,
            default to the whole dictionary size
        non_neg: to enforce non-negative coefficients
        tol_res: residual tolerance,
            exit if ||Rf|| < tol_res * ||f|| (error is small enough)
        tol_coef: residual covariance threshold,
            exit if coef_i < tol_coef * ||f|| for all i
    Output:
        active: [n] (n <= k), id of the selected components in the dictionary
        coef: [n], composition weights of the selected components
        Rf: [d], redidual vector
    """
    assert (2 == D.ndim) and (1 == f.ndim)
    assert D.shape[1] == f.shape[0], "dimension mismatch"
    m = D.shape[0]
    if (k <= 0) or (k > m):
        k = m

    f_norm = norm2(f)
    if verbose:
        print("initial residual:", f_norm)
    # tol_res = tol_res * f_norm
    tol_coef = tol_coef * f_norm
    Rf = f  # (initial) residual vector
    D_col = D.T  # [d, m], arrange as column vector
    active = []  # meta id of the selected components
    coef = np.zeros([m], dtype=np.float32)

    # degenerate solution
    if f_norm < tol_res:
        return active, coef

    if non_neg:
        get_candidate = lambda score: np.argmax(score)
        calc_coef = lambda _D, _f: scipy.optimize.nnls(_D, _f)[0]
    else:
        get_candidate = lambda score: np.argmax(np.abs(score))
        calc_coef = lambda _D, _f: scipy.linalg.lstsq(_D, _f)[0]

    for _it in range(max_iter):
        # print("---", _it, "---")
        # neglect the numerator cuz `Dict` is already normalised
        score = cos(Rf[np.newaxis, :], D)[0]  # [m]
        d_id = get_candidate(score)
        max_socre = score[d_id]
        if max_socre < tol_coef:
            break

        if d_id not in active:
            active.append(d_id)

        _coef = calc_coef(D_col[:, active], f)
        coef[active] = _coef

        f_hat = D_col[:, active].dot(_coef).flatten()
        Rf = f - f_hat

        err_rate = norm2(Rf) / f_norm
        # print("residual error rate:", err_rate)
        if err_rate < tol_res:
            break
        if len(active) >= k:
            break

    f_hat = D_col[:, active].dot(coef[active]).flatten()
    Rf = f - f_hat
    err = norm2(Rf)
    err_rate = err / f_norm
    if verbose:
    	# print("residual:", np.linalg.norm(res))
        print("final residule: {:.4f}, error rate: {:.4f}".format(err, err_rate))
    return active, coef, f_hat, Rf