PCA（主成分分析）降维原理及其在optdigits以及点云数据集上的python实现

最新推荐文章于 2025-08-07 15:47:52 发布

原创最新推荐文章于 2025-08-07 15:47:52 发布 · 4.9k 阅读

24 ·

CC 4.0 BY-SA版权

文章标签：

#PCA #主成分分析 #python #optdigits数据集 #点云数据集

数学与算法专栏收录该内容

15 篇文章

订阅专栏

本文深入解析PCA(主成分分析)的原理，包括输入矩阵的构造、协方差矩阵计算、特征分解及数据降维过程。通过Python实现PCA算法，并以optdigits数据集和点云数据为例，展示如何将高维数据降至2D或3D，便于可视化和进一步分析。

PCA(Principal Components Analysis)原理

*假设我们有m个samples，每个samples有n维特征
*那么可以构造输入矩阵 X，m行n列
*PCA降维的目标就是将原来每个用n维表示的sample用k维表示，k<n
*数学上的推导由于我原来的笔记找不到了这里就先不展开，以后有了自己新的理解再补充
*推荐入门资料：南大周志华西瓜书上对PCA的讲解，书中三页讲清楚了大部分的思想
除此之外，关于PCA的最大方差解释推荐阅读这篇文章主成分分析（Principal components analysis）-最大方差解释

PCA实现步骤

计算协方差矩阵covX = X.T.dot(X), the shape of covX is (n,n)
对covX进行特征分解，得到特征值eigenValue和特征向量eigenVector。where the shape of eigenVector is n by feature n
对eigenValue降序排序，取前k个特征值，并且将他们对应的特征向量从eigenVector中抽取出来，得到selectVec，大小为n by feature k
降维后的数据矩阵为X_lowdimension=X.dot(selectVec), 大小为m by feature k
如果k比较低的情况下（k=2或者k=3），可以将这些数据plot出来看看PCA降维到底发生了什么
5.用降维后的数据矩阵重构X, X_reconstruct=X_lowdimension.dot(selectVec.T), 大小为(m,n)

PCA核心算法python实现

'''
#meanX(dataX)
#Function: for calculating the mean of input dataX
#input parameter:numpy form "dataX", whose row indicate sample,coloum indicate character
'''
def meanX(dataX):
    return np.mean(dataX,axis=0)#axis=0 indicate calculating mean by coloum



'''
#PCA(XMat, k)
#Function: calculate input Matrix XMat's PCA result using the first k dimension character
#input parameter:
    - XMat: numpy form "XMat", whose row indicate sample,coloum indicate character
    - k: means only reserve the first k egienvalue's egienvector
#return:
    - X_afterPCA: the low dimensional matrix respect to parameter k
    - recon_XMat: reconstruct data, the matrix after shifting the coordinate
'''
def PCA(XMat, k):
    average = meanX(XMat)#calculate the mean of XMat by coloum,the shape of average is (1,n)
    m, n = np.shape(XMat)
    XMat_Centralization = []
    avgs = np.tile(average, (m, 1))#copy m average by row,the shape of avgs is(m,n)
    XMat_Centralization = XMat - avgs#data preprocessing :rawData minus every character's mean, the shape of XMat_Centralization is (m,n)
    #covX = np.cov(XMat_Centralization.T)   #Calculating the covariance matrix,the shape of covX is (n,n)
    covX = XMat_Centralization.T.dot(XMat_Centralization)#,if x's shape is (m,n),covx=X.T*X
    eigenVal, eigenVec = np.linalg.eig(covX)  #calculating eigenvalue and egienvector of covariance matrix covX,eigenvalue's shape is (n,1),eigenVec's shape is (n,n)
    index_EigenVal = np.argsort(-eigenVal) #sort eigenVal by descending order
    
    X_afterPCA = []
    if k > n:
        print ("k must lower than feature number")
        return
    else:
        #eigenVec is a coloum vector which means the eigenVec indicates (n by feature n)
        selectVec = np.matrix(eigenVec.T[index_EigenVal[:k]]) #eigenVec.T indicates (feature n by n),slect the first k feature to construct selectVec whose shape is(k,n)
        X_afterPCA = XMat_Centralization * selectVec.T #preforming PCA on XMat_Centralization,we can get X_afterPCA whose shape is (m,k),n features is transformed to k features in each sample
        recon_XMat = (X_afterPCA * selectVec) + average #reconstruct the XMat by  X_afterPCA, recon_XMat's shape is (m,n)
    return X_afterPCA, recon_XMat

optdigits数据集介绍

下载地址：http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

Title of Database: Optical Recognition of Handwritten Digits
Source:
E. Alpaydin, C. Kaynak
Department of Computer Engineering
Bogazici University, 80815 Istanbul Turkey
alpaydin@boun.edu.tr
July 1998
Past Usage:
C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
Applications to Handwritten Digit Recognition,
MSc Thesis, Institute of Graduate Studies in Science and
Engineering, Bogazici University.

E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika,
to appear. ftp://ftp.icsi.berkeley.edu/pub/ai/ethem/kyb.ps.Z
Relevant Information:
We used preprocessing programs made available by NIST to extract
normalized bitmaps of handwritten digits from a preprinted form. From
a total of 43 people, 30 contributed to the training set and different
13 to the test set. 32x32 bitmaps are divided into nonoverlapping
blocks of 4x4 and the number of on pixels are counted in each block.
This generates an input matrix of 8x8 where each element is an
integer in the range 0…16. This reduces dimensionality and gives
invariance to small distortions.

For info on NIST preprocessing routines, see
M. D. Garris, J. L. Blue, G. T. Candela, D. L. Dimmick, J. Geist,
P. J. Grother, S. A. Janet, and C. L. Wilson, NIST Form-Based
Handprint Recognition System, NISTIR 5469, 1994.
Number of Instances
optdigits.tra Training 3823
optdigits.tes Testing 1797

The way we used the dataset was to use half of training for
actual training, one-fourth for validation and one-fourth
for writer-dependent testing. The test set was used for
writer-independent testing and is the actual quality measure.
Number of Attributes
64 input+1 class attribute
For Each Attribute:
All input attributes are integers in the range 0…16.
The last attribute is the class code 0…9
Missing Attribute Values
None
Class Distribution
Class: No of examples in training set
0: 376
1: 389
2: 380
3: 389
4: 387
5: 376
6: 377
7: 387
8: 380
9: 382

Class: No of examples in testing set
0: 178
1: 182
2: 177
3: 183
4: 181
5: 182
6: 181
7: 179
8: 174
9: 180

Accuracy on the testing set with k-nn
using Euclidean distance as the metric

k = 1 : 98.00
k = 2 : 97.38
k = 3 : 97.83
k = 4 : 97.61
k = 5 : 97.89
k = 6 : 97.77
k = 7 : 97.66
k = 8 : 97.66
k = 9 : 97.72
k = 10 : 97.55
k = 11 : 97.89

case1：用PCA对optdigits.tra数据集降维

（1）从optdigits.tra抽取出所有的数字为’3‘的行
（2）对其作用PCA，将这些数据降维到2D
（3）在平面上画出这些二维散点
（4）以栅格的形式在二维散点平面上查找离标准栅格点最近的5×5共25个点
（5）将这些点对应的原数据以灰度图的形式画出来
（6）比较这25个’3‘表现出来的规律
在这里插入图片描述