PCA(主成分分析)降维原理及其在optdigits以及点云数据集上的python实现

本文深入解析PCA(主成分分析)的原理,包括输入矩阵的构造、协方差矩阵计算、特征分解及数据降维过程。通过Python实现PCA算法,并以optdigits数据集和点云数据为例,展示如何将高维数据降至2D或3D,便于可视化和进一步分析。

PCA(Principal Components Analysis)原理

*假设我们有m个samples,每个samples有n维特征
*那么可以构造输入矩阵 X,m行n列
*PCA降维的目标就是将原来每个用n维表示的sample用k维表示,k<n
*数学上的推导由于我原来的笔记找不到了这里就先不展开,以后有了自己新的理解再补充
*推荐入门资料:南大周志华西瓜书上对PCA的讲解,书中三页讲清楚了大部分的思想
除此之外,关于PCA的最大方差解释推荐阅读这篇文章主成分分析(Principal components analysis)-最大方差解释

PCA实现步骤

  1. 计算协方差矩阵covX = X.T.dot(X), the shape of covX is (n,n)
  2. 对covX进行特征分解,得到特征值eigenValue和特征向量eigenVector。where the shape of eigenVector is n by feature n
  3. 对eigenValue降序排序,取前k个特征值,并且将他们对应的特征向量从eigenVector中抽取出来,得到selectVec, 大小为n by feature k
  4. 降维后的数据矩阵为X_lowdimension=X.dot(selectVec), 大小为m by feature k
    如果k比较低的情况下(k=2或者k=3),可以将这些数据plot出来看看PCA降维到底发生了什么
    5.用降维后的数据矩阵重构X, X_reconstruct=X_lowdimension.dot(selectVec.T), 大小为(m,n)

PCA核心算法python实现

'''
#meanX(dataX)
#Function: for calculating the mean of input dataX
#input parameter:numpy form "dataX", whose row indicate sample,coloum indicate character
'''
def meanX(dataX):
    return np.mean(dataX,axis=0)#axis=0 indicate calculating mean by coloum



'''
#PCA(XMat, k)
#Function: calculate input Matrix XMat's PCA result using the first k dimension character
#input parameter:
    - XMat: numpy form "XMat", whose row indicate sample,coloum indicate character
    - k: means only reserve the first k egienvalue's egienvector
#return:
    - X_afterPCA: the low dimensional matrix respect to parameter k
    - recon_XMat: reconstruct data, the matrix after shifting the coordinate
'''
def PCA(XMat, k):
    average = meanX(XMat)#calculate the mean of XMat by coloum,the shape of average is (1,n)
    m, n = np.shape(XMat)
    XMat_Centralization = []
    avgs = np.tile(average, (m, 1))#copy m average by row,the shape of avgs is(m,n)
    XMat_Centralization = XMat - avgs#data preprocessing :rawData minus every character's mean, the shape of XMat_Centralization is (m,n)
    #covX = np.cov(XMat_Centralization.T)   #Calculating the covariance matrix,the shape of covX is (n,n)
    covX = XMat_Centralization.T.dot(XMat_Centralization)#,if x's shape is (m,n),covx=X.T*X
    eigenVal, eigenVec = np.linalg.eig(covX)  #calculating eigenvalue and egienvector of covariance matrix covX,eigenvalue's shape is (n,1),eigenVec's shape is (n,n)
    index_EigenVal = np.argsort(-eigenVal) #sort eigenVal by descending order
    
    X_afterPCA = []
    if k > n:
        print ("k must lower than feature number")
        return
    else:
        #eigenVec is a coloum vector which means the eigenVec indicates (n by feature n)
        selectVec = np.matrix(eigenVec.T[index_EigenVal[:k]]) #eigenVec.T indicates (feature n by n),slect the first k feature to construct selectVec whose shape is(k,n)
        X_afterPCA = XMat_Centralization * selectVec.T #preforming PCA on XMat_Centralization,we can get X_afterPCA whose shape is (m,k),n features is transformed to k features in each sample
        recon_XMat = (X_afterPCA * selectVec) + average #reconstruct the XMat by  X_afterPCA, recon_XMat's shape is (m,n)
    return X_afterPCA, recon_XMat

optdigits数据集介绍

下载地址:http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

  1. Title of Database: Optical Recognition of Handwritten Digits

  2. Source:
    E. Alpaydin, C. Kaynak
    Department of Computer Engineering
    Bogazici University, 80815 Istanbul Turkey
    alpaydin@boun.edu.tr
    July 1998

  3. Past Usage:
    C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
    Applications to Handwritten Digit Recognition,
    MSc Thesis, Institute of Graduate Studies in Science and
    Engineering, Bogazici University.

    E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika,
    to appear. ftp://ftp.icsi.berkeley.edu/pub/ai/ethem/kyb.ps.Z

  4. Relevant Information:
    We used preprocessing programs made available by NIST to extract
    normalized bitmaps of handwritten digits from a preprinted form. From
    a total of 43 people, 30 contributed to the training set and different
    13 to the test set. 32x32 bitmaps are divided into nonoverlapping
    blocks of 4x4 and the number of on pixels are counted in each block.
    This generates an input matrix of 8x8 where each element is an
    integer in the range 0…16. This reduces dimensionality and gives
    invariance to small distortions.

    For info on NIST preprocessing routines, see
    M. D. Garris, J. L. Blue, G. T. Candela, D. L. Dimmick, J. Geist,
    P. J. Grother, S. A. Janet, and C. L. Wilson, NIST Form-Based
    Handprint Recognition System, NISTIR 5469, 1994.

  5. Number of Instances
    optdigits.tra Training 3823
    optdigits.tes Testing 1797

    The way we used the dataset was to use half of training for
    actual training, one-fourth for validation and one-fourth
    for writer-dependent testing. The test set was used for
    writer-independent testing and is the actual quality measure.

  6. Number of Attributes
    64 input+1 class attribute

  7. For Each Attribute:
    All input attributes are integers in the range 0…16.
    The last attribute is the class code 0…9

  8. Missing Attribute Values
    None

  9. Class Distribution
    Class: No of examples in training set
    0: 376
    1: 389
    2: 380
    3: 389
    4: 387
    5: 376
    6: 377
    7: 387
    8: 380
    9: 382

    Class: No of examples in testing set
    0: 178
    1: 182
    2: 177
    3: 183
    4: 181
    5: 182
    6: 181
    7: 179
    8: 174
    9: 180

Accuracy on the testing set with k-nn
using Euclidean distance as the metric

k = 1 : 98.00
k = 2 : 97.38
k = 3 : 97.83
k = 4 : 97.61
k = 5 : 97.89
k = 6 : 97.77
k = 7 : 97.66
k = 8 : 97.66
k = 9 : 97.72
k = 10 : 97.55
k = 11 : 97.89

case1:用PCA对optdigits.tra数据集降维

(1)从optdigits.tra抽取出所有的数字为’3‘的行
(2)对其作用PCA,将这些数据降维到2D
(3)在平面上画出这些二维散点
(4)以栅格的形式在二维散点平面上查找离标准栅格点最近的5×5共25个点
(5)将这些点对应的原数据以灰度图的形式画出来
(6)比较这25个’3‘表现出来的规律
在这里插入图片描述
在这里插入图片描述

case2:用PCA对点云数据降维

在3D视觉中,点云是指由大量空间3D点组成的数据集:每一个点由x,y,z三个坐标组成。
结构光系统扫描出来的点云:
在这里插入图片描述
我们对上面这个鞋子的点云进行PCA处理,结果如下:
在这里插入图片描述

### 使用PCA进行点云法向量计算的方法 在计算机图形学和几何处理领域,使用PCA主成分分析)来计算点云的法向量是一个常见且有效的方法。通过PCA可以找到点云局部区域的最佳拟合平面,并由此确定该区域内各点的法向量。 #### 计算过程概述 对于每一个目标点,在其邻域内选取一定数量的近邻点形成子。接着对该子中所有点的位置坐标执行中心化操作——即将这些坐标的平均位置作为新的原点;之后构建协方差矩阵并求解此矩阵的最大特征值对应的单位长度特征向量,即为所求的表面法线方向[^1]。 #### 实现细节 具体到编程实现上,借助于CGAL库中的`pca_estimate_normals`函数可以直接完成上述流程: ```cpp #include <CGAL/Exact_predicates_inexact_constructions_kernel.h> #include <CGAL/Surface_mesh.h> #include <CGAL/pca_estimate_normals.h> // 定义点类型和其他必要参数... typedef CGAL::Exact_predicates_inexact_constructions_kernel Kernel; typedef Kernel::Point_3 Point; void estimateNormals(std::vector<Point>& points){ std::size_t k_neighbors = 10; // 考虑最近的k个邻居 CGAL::pca_estimate_normals(points, k_neighbors); } ``` 这段代码展示了如何调用`CGAL::pca_estimate_normals`来进行实际的法向量估算工作。这里设置了一个名为`estimateNormals`的功能函数接收一个由三空间内的离散点组成的列表作为输入,并指定考虑最接近当前顶点的多少个相邻节点用于PCA运算。最终返回的结果就是更新后的带有估计出来的法向信息的新版本点合[^4]。
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值