原文地址:http://blog.youkuaiyun.com/aalbertini/article/details/6256656
主成分分析 ( Principal Component Analysis , PCA )是一种掌握事物主要矛盾的统计分析方法,它可以从多元事物中解析出主要影响因素,揭示事物的本质,简化复杂的问题。计算主成分的目的是将高维数据投影到较低维空间。给定 n 个变量的 m 个观察值,形成一个 n ′ m 的数据矩阵, n 通常比较大。对于一个由多个变量描述的复杂事物,人们难以认识,那么是否可以抓住事物主要方面进行重点分析呢?如果事物的主要方面刚好体现在几个主要变量上,我们只需要将这几个变量分离出来,进行详细分析。但是,在一般情况下,并不能直接找出这样的关键变量。这时我们可以用原有变量的线性组合来表示事物的主要方面, PCA 就是这样一种分析方法。
PCA 主要 用于数据降维,对于一系列例子的特征组成的多维向量,多维向量里的某些元素本身没有区分性,比如某个元素在所有的例子中都为1,或者与1差距不大,那么这个元素本身就没有区分性,用它做特征来区分,贡献会非常小。所以我们的目的是找那些变化大的元素,即方差大的那些维,而去除掉那些变化不大的维,从而使特征留下的都是“精品”,而且计算量也变小了。 对于一个k维的特征来说,相当于它的每一维特征与其他维都是正交的(相当于在多维坐标系中,坐标轴都是垂直的),那么我们可以变化这些维的坐标系,从而使这个特征在某些维上方差大,而在某些维上方差很小。例如,一个45度倾斜的椭圆,在第一坐标系,如果按照x,y坐标来投影,这些点的x和y的属性很难用于区分他们,因为他们在x,y轴上坐标变化的方差都差不多,我们无法根据这个点的某个x属性来判断这个点是哪个,而如果将坐标轴旋转,以椭圆长轴为x轴,则椭圆在长轴上的分布比较长,方差大,而在短轴上的分布短,方差小,所以可以考虑只保留这些点的长轴属性,来区分椭圆上的点,这样,区分性比x,y轴的方法要好!
所以我们的做法就是求得一个k维特征的投影矩阵,这个投影矩阵可以将特征从高维降到低维。投影矩阵也可以叫做变换矩阵。新的低维特征必须每个维都正交,特征向量都是正交的。通过求样本矩阵的协方差矩阵,然后求出协方差矩阵的特征向量,这些特征向量就可以构成这个投影矩阵了。特征向量的选择取决于协方差矩阵的特征值的大小。
举一个例子:
对于一个训练集,100个对象模板,特征是10维,那么它可以建立一个100*10的矩阵,作为样本。求这个样本的协方差矩阵,得到一个10*10的协方差矩阵,然后求出这个协方差矩阵的特征值和特征向量,应该有10个特征值和特征向量,我们根据特征值的大小,取前四个特征值所对应的特征向量,构成一个10*4的矩阵,这个矩阵就是我们要求的特征矩阵,100*10的样本矩阵乘以这个10*4的特征矩阵,就得到了一个100*4的新的降维之后的样本矩阵,每个特征的维数下降了。
当给定一个测试的特征集之后,比如1*10维的特征,乘以上面得到的10*4的特征矩阵,便可以得到一个1*4的特征,用这个特征去分类。
所以做PCA实际上是求得这个投影矩阵,用高维的特征乘以这个投影矩阵,便可以将高维特征的维数下降到指定的维数。
PCA 的目标是寻找 r ( r<n )个新变量,使它们反映事物的主要特征,压缩原有数据矩阵的规模。每个新变量是原有变量的线性组合,体现原有变量的综合效果,具有一定的实际含义。这 r 个新变量称为“主成分”,它们可以在很大程度上反映原来 n 个变量的影响,并且这些新变量是互不相关的,也是正交的。通过主成分分析,压缩数据空间,将多元数据的特征在低维空间里直观地表示出来。例如,将多个时间点、多个实验条件下的基因表达谱数据( N 维)表示为 3 维空间中的一个点,即将数据的维数从 RN 降到 R3 。
在进行基因表达数据分析时,一个重要问题是确定每个实验数据是否是独立的,如果每次实验数据之间不是独立的,则会影响基因表达数据分析结果的准确性。对于利用基因芯片所检测到的基因表达数据,如果用 PCA 方法进行分析,可以将各个基因作为变量,也可以将实验条件作为变量。当将基因作为变量时,通过分析确定一组“主要基因元素”,它们能够很好地说明基因的特征,解释实验现象;当将实验条件作为变量时,通过分析确定一组“主要实验因素”,它们能够很好地刻画实验条件的特征,解释基因的行为。下面着重考虑以实验条件作为变量的 PCA 分析方法。假设将数据的维数从 R N 降到 R 3 ,具体的 PCA 分析步骤如下:
(1) 第一步计算矩阵 X 的样本的协方差矩阵 S :
(2) 第二步计算协方差矩阵S的本征向量 e1,e2,…,eN的本征值 , i = 1,2,…,N 。本征值按大到小排序: ;
(3)第三步投影数据到本征矢张成的空间之中,这些本征矢相应的本征值为 。现在数据可以在三维空间中展示为云状的点集。
对于 PCA ,确定新变量的个数 r 是一个两难的问题。我们的目标是减小 r ,如果 r 小,则数据的维数低,便于分析,同时也降低了噪声,但可能丢失一些有用的信息。究竟如何确定 r 呢?这需要进一步分析每个主元素对信息的贡献。
令 代表第 i 个特征值,定义第 i 个主元素的贡献率为:
(8-45)
前 r 个主成分的累计贡献率为:
(8-46)
贡献率表示所定义的主成分在整个数据分析中承担的主要意义占多大的比重,当取前 r 个主成分来代替原来全部变量时,累计贡献率的大小反应了这种取代的可靠性,累计贡献率越大,可靠性越大;反之,则可靠性越小。一般要求累计贡献率达到 70% 以上。
经过 PCA 分析,一个多变量的复杂问题被简化为低维空间的简单问题。可以利用这种简化方法进行作图,形象地表示和分析复杂问题。在分析基因表达数据时,可以针对基因作图,也可以针对实验条件作图。前者称为 Q 分析,后者称为 R 分析。
1) 协方差矩阵
2) 特征值与特征向量
3) 贡献率
http://www.cad.zju.edu.cn/home/chenlu/pca.htm
- /*
- * PrincipalComponents.java
- * Copyright (C) 2000 University of Waikato, Hamilton, New Zealand
- *
- */
- package weka.attributeSelection;
- /**
- <!-- globalinfo-start -->
- * Performs a principal components analysis and transformation of the data. Use in conjunction with a Ranker search. Dimensionality reduction is accomplished by choosing enough eigenvectors to account for some percentage of the variance in the original data---default 0.95 (95%). Attribute noise can be filtered by transforming to the PC space, eliminating some of the worst eigenvectors, and then transforming back to the original space.
- * <p/>
- <!-- globalinfo-end -->
- *
- <!-- options-start -->
- * Valid options are: <p/>
- *
- * <pre> -D
- * Don't normalize input data.</pre>
- *
- * <pre> -R
- * Retain enough PC attributes to account
- * for this proportion of variance in the original data.
- * (default = 0.95)</pre>
- *
- * <pre> -O
- * Transform through the PC space and
- * back to the original space.</pre>
- *
- * <pre> -A
- * Maximum number of attributes to include in
- * transformed attribute names. (-1 = include all)</pre>
- *
- <!-- options-end -->
- *
- * @author Mark Hall (mhall@cs.waikato.ac.nz)
- * @author Gabi Schmidberger (gabi@cs.waikato.ac.nz)
- * @version $Revision: 5987 $
- */
- public class PrincipalComponents
- extends UnsupervisedAttributeEvaluator
- implements AttributeTransformer, OptionHandler {
- /** for serialization */
- static final long serialVersionUID = 3310137541055815078L;
- /** The data to transform analyse/transform */
- private Instances m_trainInstances;
- /** Keep a copy for the class attribute (if set) */
- private Instances m_trainHeader;
- /** The header for the transformed data format */
- private Instances m_transformedFormat;
- /** The header for data transformed back to the original space */
- private Instances m_originalSpaceFormat;
- /** Data has a class set */
- private boolean m_hasClass;
- /** Class index */
- private int m_classIndex;
- /** Number of attributes */
- private int m_numAttribs;
- /** Number of instances */
- private int m_numInstances;
- /** Correlation matrix for the original data */
- private double [][] m_correlation;
- /** Will hold the unordered linear transformations of the (normalized)
- original data */
- private double [][] m_eigenvectors;
- /** Eigenvalues for the corresponding eigenvectors */
- private double [] m_eigenvalues = null;
- /** Sorted eigenvalues */
- private int [] m_sortedEigens;
- /** sum of the eigenvalues */
- private double m_sumOfEigenValues = 0.0;
- /** Filters for original data */
- private ReplaceMissingValues m_replaceMissingFilter;
- private Normalize m_normalizeFilter;
- private NominalToBinary m_nominalToBinFilter;
- private Remove m_attributeFilter;
- /** used to remove the class column if a class column is set */
- private Remove m_attribFilter;
- /** The number of attributes in the pc transformed data */
- private int m_outputNumAtts = -1;
- /** normalize the input data? */
- private boolean m_normalize = true;
- /** the amount of varaince to cover in the original data when
- retaining the best n PC's */
- private double m_coverVariance = 0.95;
- /** transform the data through the pc space and back to the original
- space ? */
- private boolean m_transBackToOriginal = false;
- /** maximum number of attributes in the transformed attribute name */
- private int m_maxAttrsInName = 5;
- /** holds the transposed eigenvectors for converting back to the
- original space */
- private double [][] m_eTranspose;
- /**
- * Returns a string describing this attribute transformer
- * @return a description of the evaluator suitable for
- * displaying in the explorer/experimenter gui
- */
- public String globalInfo() {
- return "Performs a principal components analysis and transformation of "
- +"the data. Use in conjunction with a Ranker search. Dimensionality "
- +"reduction is accomplished by choosing enough eigenvectors to "
- +"account for some percentage of the variance in the original data---"
- +"default 0.95 (95%). Attribute noise can be filtered by transforming "
- +"to the PC space, eliminating some of the worst eigenvectors, and "
- +"then transforming back to the original space.";
- }
- /**
- * Returns an enumeration describing the available options. <p>
- *
- * @return an enumeration of all the available options.
- **/
- public Enumeration listOptions () {
- Vector newVector = new Vector(3);
- newVector.addElement(new Option("/tDon't normalize input data."
- , "D", 0, "-D"));
- newVector.addElement(new Option("/tRetain enough PC attributes to account "
- +"/n/tfor this proportion of variance in "
- +"the original data./n"
- + "/t(default = 0.95)",
- "R",1,"-R"));
- newVector.addElement(new Option("/tTransform through the PC space and "
- +"/n/tback to the original space."
- , "O", 0, "-O"));
- newVector.addElement(new Option("/tMaximum number of attributes to include in "
- + "/n/ttransformed attribute names. (-1 = include all)"
- , "A", 1, "-A"));
- return newVector.elements();
- }
- /**
- * Parses a given list of options. <p/>
- *
- <!-- options-start -->
- * Valid options are: <p/>
- *
- * <pre> -D
- * Don't normalize input data.</pre>
- *
- * <pre> -R
- * Retain enough PC attributes to account
- * for this proportion of variance in the original data.
- * (default = 0.95)</pre>
- *
- * <pre> -O
- * Transform through the PC space and
- * back to the original space.</pre>
- *
- * <pre> -A
- * Maximum number of attributes to include in
- * transformed attribute names. (-1 = include all)</pre>
- *
- <!-- options-end -->
- *
- * @param options the list of options as an array of strings
- * @throws Exception if an option is not supported
- */
- public void setOptions (String[] options)
- throws Exception {
- resetOptions();
- String optionString;
- optionString = Utils.getOption('R', options);
- if (optionString.length() != 0) {
- Double temp;
- temp = Double.valueOf(optionString);
- setVarianceCovered(temp.doubleValue());
- }
- optionString = Utils.getOption('A', options);
- if (optionString.length() != 0) {
- setMaximumAttributeNames(Integer.parseInt(optionString));
- }
- setNormalize(!Utils.getFlag('D', options));
- setTransformBackToOriginal(Utils.getFlag('O', options));
- }
- /**
- * Reset to defaults
- */
- private void resetOptions() {
- m_coverVariance = 0.95;
- m_normalize = true;
- m_sumOfEigenValues = 0.0;
- m_transBackToOriginal = false;
- }
- /**
- * Returns the tip text for this property
- * @return tip text for this property suitable for
- * displaying in the explorer/experimenter gui
- */
- public String normalizeTipText() {
- return "Normalize input data.";
- }
- /**
- * Set whether input data will be normalized.
- * @param n true if input data is to be normalized
- */
- public void setNormalize(boolean n) {
- m_normalize = n;
- }
- /**
- * Gets whether or not input data is to be normalized
- * @return true if input data is to be normalized
- */
- public boolean getNormalize() {
- return m_normalize;
- }
- /**
- * Returns the tip text for this property
- * @return tip text for this property suitable for
- * displaying in the explorer/experimenter gui
- */
- public String varianceCoveredTipText() {
- return "Retain enough PC attributes to account for this proportion of "
- +"variance.";
- }
- /**
- * Sets the amount of variance to account for when retaining
- * principal components
- * @param vc the proportion of total variance to account for
- */
- public void setVarianceCovered(double vc) {
- m_coverVariance = vc;
- }
- /**
- * Gets the proportion of total variance to account for when
- * retaining principal components
- * @return the proportion of variance to account for
- */
- public double getVarianceCovered() {
- return m_coverVariance;
- }
- /**
- * Returns the tip text for this property
- * @return tip text for this property suitable for
- * displaying in the explorer/experimenter gui
- */
- public String maximumAttributeNamesTipText() {
- return "The maximum number of attributes to include in transformed attribute names.";
- }
- /**
- * Sets maximum number of attributes to include in
- * transformed attribute names.
- * @param m the maximum number of attributes
- */
- public void setMaximumAttributeNames(int m) {
- m_maxAttrsInName = m;
- }
- /**
- * Gets maximum number of attributes to include in
- * transformed attribute names.
- * @return the maximum number of attributes
- */
- public int getMaximumAttributeNames() {
- return m_maxAttrsInName;
- }
- /**
- * Returns the tip text for this property
- * @return tip text for this property suitable for
- * displaying in the explorer/experimenter gui
- */
- public String transformBackToOriginalTipText() {
- return "Transform through the PC space and back to the original space. "
- +"If only the best n PCs are retained (by setting varianceCovered < 1) "
- +"then this option will give a dataset in the original space but with "
- +"less attribute noise.";
- }
- /**
- * Sets whether the data should be transformed back to the original
- * space
- * @param b true if the data should be transformed back to the
- * original space
- */
- public void setTransformBackToOriginal(boolean b) {
- m_transBackToOriginal = b;
- }
- /**
- * Gets whether the data is to be transformed back to the original
- * space.
- * @return true if the data is to be transformed back to the original space
- */
- public boolean getTransformBackToOriginal() {
- return m_transBackToOriginal;
- }
- /**
- * Gets the current settings of PrincipalComponents
- *
- * @return an array of strings suitable for passing to setOptions()
- */
- public String[] getOptions () {
- String[] options = new String[6];
- int current = 0;
- if (!getNormalize()) {
- options[current++] = "-D";
- }
- options[current++] = "-R";
- options[current++] = ""+getVarianceCovered();
- options[current++] = "-A";
- options[current++] = ""+getMaximumAttributeNames();
- if (getTransformBackToOriginal()) {
- options[current++] = "-O";
- }
- while (current < options.length) {
- options[current++] = "";
- }
- return options;
- }
- /**
- * Returns the capabilities of this evaluator.
- *
- * @return the capabilities of this evaluator
- * @see Capabilities
- */
- public Capabilities getCapabilities() {
- Capabilities result = super.getCapabilities();
- result.disableAll();
- // attributes
- result.enable(Capability.NOMINAL_ATTRIBUTES);
- result.enable(Capability.NUMERIC_ATTRIBUTES);
- result.enable(Capability.DATE_ATTRIBUTES);
- result.enable(Capability.MISSING_VALUES);
- // class
- result.enable(Capability.NOMINAL_CLASS);
- result.enable(Capability.NUMERIC_CLASS);
- result.enable(Capability.DATE_CLASS);
- result.enable(Capability.MISSING_CLASS_VALUES);
- result.enable(Capability.NO_CLASS);
- return result;
- }
- /**
- * Initializes principal components and performs the analysis
- * @param data the instances to analyse/transform
- * @throws Exception if analysis fails
- */
- public void buildEvaluator(Instances data) throws Exception {
- // can evaluator handle data?
- getCapabilities().testWithFail(data);
- buildAttributeConstructor(data);
- }
- private void buildAttributeConstructor (Instances data) throws Exception {
- m_eigenvalues = null;
- m_outputNumAtts = -1;
- m_attributeFilter = null;
- m_nominalToBinFilter = null;
- m_sumOfEigenValues = 0.0;
- m_trainInstances = new Instances(data);
- // make a copy of the training data so that we can get the class
- // column to append to the transformed data (if necessary)
- m_trainHeader = new Instances(m_trainInstances, 0);
- //处理缺失值、归一化、类别二值化、删除单值属性或者都为缺失值的属性
- //mormalize nominaltobin
- //TODO.... deleted
- m_numInstances = m_trainInstances.numInstances();
- m_numAttribs = m_trainInstances.numAttributes();
- //计算相关矩阵,得到m_numAttribs * m_numAttribs的方阵
- fillCorrelation();
- double [] d = new double[m_numAttribs];
- double [][] v = new double[m_numAttribs][m_numAttribs];
- Matrix corr = new Matrix(m_correlation);
- //得到特征向量和特征值
- corr.eigenvalueDecomposition(v, d);
- m_eigenvectors = (double [][])v.clone();
- m_eigenvalues = (double [])d.clone();
- // any eigenvalues less than 0 are not worth anything --- change to 0
- for (int i = 0; i < m_eigenvalues.length; i++) {
- if (m_eigenvalues[i] < 0) {
- m_eigenvalues[i] = 0.0;
- }
- }
- m_sortedEigens = Utils.sort(m_eigenvalues);
- m_sumOfEigenValues = Utils.sum(m_eigenvalues);
- }
- /**
- * Returns just the header for the transformed data (ie. an empty
- * set of instances. This is so that AttributeSelection can
- * determine the structure of the transformed data without actually
- * having to get all the transformed data through transformedData().
- * @return the header of the transformed data.
- * @throws Exception if the header of the transformed data can't
- * be determined.
- */
- public Instances transformedHeader() throws Exception {
- }
- /**
- * Gets the transformed training data.
- * @return the transformed training data
- * @throws Exception if transformed data can't be returned
- */
- public Instances transformedData(Instances data) throws Exception {
- }
- /**
- * Evaluates the merit of a transformed attribute. This is defined
- * to be 1 minus the cumulative variance explained. Merit can't
- * be meaningfully evaluated if the data is to be transformed back
- * to the original space.
- * @param att the attribute to be evaluated
- * @return the merit of a transformed attribute
- * @throws Exception if attribute can't be evaluated
- */
- public double evaluateAttribute(int att) throws Exception {
- if (m_eigenvalues == null) {
- throw new Exception("Principal components hasn't been built yet!");
- }
- if (m_transBackToOriginal) {
- return 1.0; // can't evaluate back in the original space!
- }
- // return 1-cumulative variance explained for this transformed att
- double cumulative = 0.0;
- for (int i = m_numAttribs - 1; i >= m_numAttribs - att - 1; i--) {
- cumulative += m_eigenvalues[m_sortedEigens[i]];
- }
- return 1.0 - cumulative / m_sumOfEigenValues;
- }
- /**
- * Fill the correlation matrix
- */
- private void fillCorrelation() {
- m_correlation = new double[m_numAttribs][m_numAttribs];
- double [] att1 = new double [m_numInstances];
- double [] att2 = new double [m_numInstances];
- double corr;
- for (int i = 0; i < m_numAttribs; i++) {
- for (int j = 0; j < m_numAttribs; j++) {
- if (i == j) {
- m_correlation[i][j] = 1.0;
- } else {
- for (int k = 0; k < m_numInstances; k++) {
- att1[k] = m_trainInstances.instance(k).value(i);
- att2[k] = m_trainInstances.instance(k).value(j);
- }
- corr = Utils.correlation(att1,att2,m_numInstances);
- m_correlation[i][j] = corr;
- m_correlation[j][i] = corr;
- }
- }
- }
- }
- /**
- * Main method for testing this class
- * @param argv should contain the command line arguments to the
- * evaluator/transformer (see AttributeSelection)
- */
- public static void main(String [] argv) {
- runEvaluator(new PrincipalComponents(), argv);
- }
- }