Simultaneous sparse estimation of canonical vectors in the p>> N setting

最新推荐文章于 2019-12-11 16:10:14 发布

艳艳儿

最新推荐文章于 2019-12-11 16:10:14 发布

阅读量680

点赞数

分类专栏： statistics 文章标签： GLDA

本文链接：https://blog.youkuaiyun.com/COMEYAN/article/details/50514596

版权

statistics 专栏收录该内容

20 篇文章

订阅专栏

Goal
Method
- 1 ratio-trace version of LDA
- 2 Model
Theory
Algorithm

Paper can be found in
http://arxiv.org/abs/1403.6095

1. Goal

In this paper, the authors are concerned with multi-group classification(LDA) and want to estimate G-1 canonical vectors simultaneously rather than estimate in a sequential fashion.

2. Method

2.1 ratio-trace version of LDA

Suppose $\Sigma_b \in \mathbb{R}^{p\times p}$ and $\Sigma_w\in\mathbb{R}^{p\times p}$ are population between and within group covariance matrices. Note that ratio-trace version of linear discriminant analysis is dealing with the following optimization problem:

max Tr ((V T Σ w V) - 1 V T Σ b V)

$\begin{align} \max\quad&\text{Tr}\left(\left(V^T\Sigma_w V\right)^{-1}V^T\Sigma_b V\right)\\ \end{align}$
which can be solved via generalized eigen decomposition of $\Sigma_w^{-1}\Sigma_b$ .

proof: Taking derivative with respect to $V$ and setting to zero,

\partial Tr ( ( V T Σ w V ) - 1 V T Σ b V ) \partial V = - 2 Σ w V (V T Σ w V) - 1 V T Σ b V (V T Σ w V) - 1 + 2 Σ b V (V T Σ w V) - 1 = 0

$\begin{align} &\frac{\partial \text{Tr}\left(\left(V^T\Sigma_w V\right)^{-1}V^T\Sigma_b V\right)}{\partial V}\\ &=-2\Sigma_w V \left(V^T\Sigma_w V\right)^{-1}V^T\Sigma_b V\left(V^T\Sigma_w V\right)^{-1}+2\Sigma_b V\left(V^T\Sigma_w V\right)^{-1}=0 \end{align}$

how to get this formula can be found in
http://blog.youkuaiyun.com/comeyan/article/details/50514610

we have

\Rightarrow \Rightarrow \Rightarrow Σ w V (V T Σ w V) - 1 V T Σ b V = Σ b V Σ - 1 w Σ b V = V (V T Σ w V) - 1 V T Σ b V Σ - 1 w Σ b V = V B Θ B - 1 Σ - 1 w Σ b V B = V B Θ

$\begin{align} &\Sigma_w V \left(V^T\Sigma_w V\right)^{-1}V^T\Sigma_b V = \Sigma_b V\\ \Rightarrow &\Sigma_w^{-1} \Sigma_b V=V \left(V^T\Sigma_w V\right)^{-1}V^T\Sigma_b V \\ \Rightarrow&\Sigma_w^{-1} \Sigma_b V = V B \Theta B^{-1}\\ \Rightarrow&\Sigma_w^{-1} \Sigma_b VB = VB\Theta \end{align}$

Note that:

The third equivalence is because we can simultaneously diagonalize $V^T\Sigma_w V$ and $V^T\Sigma_ V$ using $B$ , which is to say( $\Theta$ is a diagonal matrix)
$B - 1 V T Σ w V B = I, B - 1 V T Σ b V B = Θ$ $B^{-1}V^T\Sigma_w V B = I, B^{-1}V^T\Sigma_b VB = \Theta$
So
$V T Σ w V = I, V T Σ b V = B Θ B - 1$ $V^T\Sigma_w V = I, V^T\Sigma_b V = B\Theta B^{-1}$
how to diagonalize two matrices simultaneously can be found in
http://blog.youkuaiyun.com/comeyan/article/details/50521034
The fourth equivalence tells us that $VB$ is the generalized eigen vectors of $\Sigma_w^{-1}\Sigma_b$

Summarize:
From the above analysis, we know that to solve the ratio-trace version of LDA is to solve a generalized eigen decomposition problem, which can be done by diagonaling these two matrices simultaneously.

2.2 Model

Since the ratio-trace version of LDA is to solve a generalized eigen decomposition problem of $\Sigma_w^{-1}\Sigma_b$ and the eigenvectors are unique only up to normalization, Taking advantage of the unique of the eigenspace in defining a scale-invariant classification rule.

Notations to be used:

$G$ is the total number of groups.
$\text{rank}(\Sigma_b)=G-1$
$\mu_i, i=1,2,\cdots, G$ is the mean of each group.
$\pi_i, i=1,2,\cdots,G$ is the prior probability of each group

The goal of this paper is to find find $G-1$ eigenvectors $\Phi$ corresponding to non-zero eigenvalues of $\Sigma_w^{-1}\Sigma_b$ . *

Note that the $G-1$ vectors can be formularized in a closed form.
Proposition 1(population version). The following decomposition holds: $\Sigma_b = \Delta\Delta^T$ , where for $r=1,2,\cdots, G-1$ the $r$ th column of $\Delta$ has the form

Δ r = π r + 1 - - - - \sqrt ( \sum r i = 1 π i ( μ i - μ r + 1 ) ) \sum r i = 1 π r \sum r + 1 i = 1 π i - - - - - - - - - - - - - \sqrt

$\Delta_r = \frac{\sqrt{\pi_{r+1}}\left(\sum_{i=1}^r\pi_i(\mu_i - \mu_{r+1})\right)}{\sqrt{\sum_{i=1}^r \pi_r \sum_{i=1}^{r+1}\pi_i}}$

Proposition 2(sample version). The following decomposition holds: $\hat\Sigma_b =DD^T$ , where for $r=1,2,\cdots, G-1$ the $r$ th column of $D$ has the form

D r = n r + 1 - - - - \sqrt ( \sum r i = 1 n i ( μ ^ i - μ ^ r + 1 ) ) N - - \sqrt \sum r i = 1 n i \sum r + 1 i = 1 n i - - - - - - - - - - - - - \sqrt

$D_r = \frac{\sqrt{n_{r+1}}\left(\sum_{i=1}^rn_i(\hat \mu_i - \hat\mu_{r+1})\right)}{\sqrt{N}\sqrt{\sum_{i=1}^r n_i \sum_{i=1}^{r+1}n_i}}$

the proof of sample version can be found in(using orthogonal contrasts of unbalanced data)
http://blog.youkuaiyun.com/COMEYAN/article/details/50521276
orthogonal contrasts can be found in

Formularize $\Sigma_b$ in this form of low-rank decomposition is because it has a closed form and have intuitive interpretation in terms of the differences between the group means.

Then give the closed form of generalized eigen vectors $\Phi$

Proposition 3. Define $\Delta$ and $D$ as in last two propositions. There exists a matrix $P\in \mathbb{O}^{G-1}$ such that

Φ = Σ - 1 w Δ P

$\Phi = \Sigma_w^{-1}\Delta P$
Moreover, if

Σ^w $\hat\Sigma_w$ is nonsingular, there exists a matrix

R∈OG−1 $R\in \mathbb{O}^{G-1}$ such that

Φ^= Σ^- 1 w D R

$\hat\Phi = \hat\Sigma_w^{-1}D R$

If we use Mahalanobis distance for classification, the classification function is free of orthogonal scale, so we can use a simple projection matrix instead.

Φ ~ = Σ - 1 w Δ

$\begin{align} \tilde \Phi &= \Sigma_w^{-1}\Delta\\ \end{align}$

Now it is sufficient to estimate $\tilde \Phi = \Sigma_w^{-1}\Delta$ , which can be defined as

Φ ~ = arg min V \in R p \times G - 1 1 2 ∥ ∥ Σ 1 / 2 w V - Σ - 1 / 2 w Δ ∥ ∥ 2 F = 1 2 tr (V T Σ w V - 2 V T Δ)

$\tilde\Phi = \arg\min_{V\in\mathbb{R}^{p\times G-1}} \frac{1}{2}\left\|\Sigma_w^{1/2} V - \Sigma_w^{-1/2}\Delta\right\|_F^2 = \frac{1}{2}\text{tr}\left(V^T\Sigma_w V - 2V^T \Delta\right)$

Sample-version
Model step 1

V ~ = arg min V \in R p \times G - 1 1 2 tr (V T Σ^w V - 2 V T D)

$\tilde V = \arg\min_{V\in\mathbb{R}^{p\times G-1}} \frac{1}{2}\text{tr}\left(V^T\hat\Sigma_w V - 2V^TD\right)$

To get sparse solution, adding group-lasso penalty
Model step 2

V^= arg min V \in R p \times G - 1 1 2 tr (V T Σ^w V - 2 V T D) + λ \sum i = 2 p ∥ V i * ∥ 2

$\hat V = \arg\min_{V\in\mathbb{R}^{p\times G-1}} \frac{1}{2}\text{tr}\left(V^T\hat\Sigma_w V - 2V^TD\right) +\lambda\sum_{i=2}^p \|V_{i*}\|_2$

But the objective function can be unbounded below when $\hat\Sigma_w$ is singular. So regularization of $\hat\Sigma_w$ is needed, $\tilde \Sigma_w = \hat\Sigma_w+\rho I$

Model step 3

V^= arg min V \in R p \times G - 1 1 2 tr (V T Σ ~ w V - 2 V T D) + λ \sum i = 2 p ∥ V i * ∥ 2 = arg min V \in R p \times G - 1 1 2 tr (V T Σ^w V) + ρ 2 ∥ ∥ ∥ V - 1 ρ D ∥ ∥ ∥ 2 F + λ \sum i = 2 p ∥ V i * ∥ 2

$\begin{align} \hat V &= \arg\min_{V\in\mathbb{R}^{p\times G-1}} \frac{1}{2}\text{tr}\left(V^T\tilde\Sigma_w V - 2V^TD\right) +\lambda\sum_{i=2}^p \|V_{i*}\|_2 \\ &= \arg\min_{V\in\mathbb{R}^{p\times G-1}} \frac{1}{2}\text{tr}\left(V^T\hat\Sigma_w V \right) + \frac{\rho}{2}\left\|V - \frac{1}{\rho}D\right\|_F^2 + \lambda\sum_{i=2}^p \|V_{i*}\|_2 \end{align}$

But this means letting $\hat V$ being close to $\frac{1}{\rho}D$ , which is not true. So replacing $\hat\Sigma_w$ by $\hat\Sigma_t$ , we have the final model

Model step 4

V^= arg min V \in R p \times G - 1 1 2 tr (V T Σ^t V - 2 V T D) + λ \sum i = 2 p ∥ V i * ∥ 2 = arg min V \in R p \times G - 1 1 2 tr (V T Σ^w V) + 1 2 ∥ ∥ V T D - I ∥ ∥ 2 F + λ \sum i = 2 p ∥ V i * ∥ 2

$\begin{align} \hat V &= \arg\min_{V\in\mathbb{R}^{p\times G-1}} \frac{1}{2}\text{tr}\left(V^T\hat\Sigma_t V - 2V^TD\right) +\lambda\sum_{i=2}^p \|V_{i*}\|_2 \\ &= \arg\min_{V\in\mathbb{R}^{p\times G-1}} \frac{1}{2}\text{tr}\left(V^T\hat\Sigma_w V \right) + \frac{1}{2}\left\|V^TD - I\right\|_F^2 + \lambda\sum_{i=2}^p \|V_{i*}\|_2 \end{align}$

the first item is to minimize the within group variability,
the second item is to control the level of the between group variability
the third item is to induce sparsity

3. Theory

The model has model selection property and the misclassification error coincides with the population rule.

4. Algorithm

Block coordinate descent algorithm is used to get the solution. As it takes advantage of warm starts when solving for a range of tuning parameters and is onr of the fastest algorithm for smooth losses with separable regularizers.