Paper can be found in
http://arxiv.org/abs/1403.6095
1. Goal
In this paper, the authors are concerned with multi-group classification(LDA) and want to estimate G-1 canonical vectors simultaneously rather than estimate in a sequential fashion.
2. Method
2.1 ratio-trace version of LDA
Suppose
Σb∈Rp×p
and
Σw∈Rp×p
are population between and within group covariance matrices. Note that ratio-trace version of linear discriminant analysis is dealing with the following optimization problem:
which can be solved via generalized eigen decomposition of Σ−1wΣb .
proof: Taking derivative with respect to
V
and setting to zero,
how to get this formula can be found in
http://blog.youkuaiyun.com/comeyan/article/details/50514610
we have
Note that:
- The third equivalence is because we can simultaneously diagonalize
VTΣwV
and
VTΣV
using
B
, which is to say(
Θ is a diagonal matrix)
B−1VTΣwVB=I,B−1VTΣbVB=Θ
So
VTΣwV=I,VTΣbV=BΘB−1
how to diagonalize two matrices simultaneously can be found in
http://blog.youkuaiyun.com/comeyan/article/details/50521034 - The fourth equivalence tells us that VB is the generalized eigen vectors of Σ−1wΣb
Summarize:
From the above analysis, we know that to solve the ratio-trace version of LDA is to solve a generalized eigen decomposition problem, which can be done by diagonaling these two matrices simultaneously.
2.2 Model
Since the ratio-trace version of LDA is to solve a generalized eigen decomposition problem of Σ−1wΣb and the eigenvectors are unique only up to normalization, Taking advantage of the unique of the eigenspace in defining a scale-invariant classification rule.
Notations to be used:
- G is the total number of groups.
rank(Σb)=G−1 - μi,i=1,2,⋯,G is the mean of each group.
- πi,i=1,2,⋯,G is the prior probability of each group
The goal of this paper is to find find G−1 eigenvectors Φ corresponding to non-zero eigenvalues of Σ−1wΣb . *
Note that the
G−1
vectors can be formularized in a closed form.
Proposition 1(population version). The following decomposition holds:
Σb=ΔΔT
, where for
r=1,2,⋯,G−1
the
r
th column of
Proposition 2(sample version). The following decomposition holds:
Σ^b=DDT
, where for
r=1,2,⋯,G−1
the
r
th column of
the proof of sample version can be found in(using orthogonal contrasts of unbalanced data)
http://blog.youkuaiyun.com/COMEYAN/article/details/50521276
orthogonal contrasts can be found in
Formularize Σb in this form of low-rank decomposition is because it has a closed form and have intuitive interpretation in terms of the differences between the group means.
Then give the closed form of generalized eigen vectors Φ
Proposition 3. Define
Δ
and
D
as in last two propositions. There exists a matrix
Moreover, if Σ^w is nonsingular, there exists a matrix R∈OG−1 such that
If we use Mahalanobis distance for classification, the classification function is free of orthogonal scale, so we can use a simple projection matrix instead.
Now it is sufficient to estimate Φ~=Σ−1wΔ , which can be defined as
Sample-version
Model step 1
To get sparse solution, adding group-lasso penalty
Model step 2
But the objective function can be unbounded below when Σ^w is singular. So regularization of Σ^w is needed, Σ~w=Σ^w+ρI
Model step 3
But this means letting V^ being close to 1ρD , which is not true. So replacing Σ^w by Σ^t , we have the final model
Model step 4
- the first item is to minimize the within group variability,
- the second item is to control the level of the between group variability
- the third item is to induce sparsity
3. Theory
The model has model selection property and the misclassification error coincides with the population rule.
4. Algorithm
Block coordinate descent algorithm is used to get the solution. As it takes advantage of warm starts when solving for a range of tuning parameters and is onr of the fastest algorithm for smooth losses with separable regularizers.