数据:
序号 | x1 | x2 | x3 | x4 |
1 | 40 | 2 | 5 | 20 |
2 | 10 | 1.5 | 5 | 30 |
3 | 120 | 3 | 13 | 50 |
4 | 250 | 4.5 | 18 | 0 |
5 | 120 | 3.5 | 9 | 50 |
6 | 10 | 1.5 | 12 | 50 |
7 | 40 | 1 | 19 | 40 |
8 | 270 | 4 | 13 | 60 |
9 | 280 | 3.5 | 11 | 60 |
10 | 170 | 3 | 9 | 60 |
11 | 180 | 3.5 | 14 | 40 |
12 | 130 | 2 | 30 | 50 |
13 | 220 | 1.5 | 17 | 20 |
14 | 160 | 1.5 | 35 | 60 |
15 | 220 | 2.5 | 14 | 30 |
16 | 140 | 2 | 20 | 20 |
17 | 220 | 2 | 14 | 10 |
18 | 40 | 1 | 10 | 0 |
19 | 20 | 1 | 12 | 60 |
20 | 120 | 2 | 20 | 0 |
数据标准化:
| x1 | x2 | x3 | x4 |
0 | -1.102513 | -0.308130 | -1.347755 | -0.708447 |
1 | -1.440017 | -0.782175 | -1.347755 | -0.251384 |
2 | -0.202502 | 0.639961 | -0.269551 | 0.662740 |
3 | 1.260015 | 2.062098 | 0.404327 | -1.622571 |
4 | -0.202502 | 1.114007 | -0.808653 | 0.662740 |
5 | -1.440017 | -0.782175 | -0.404327 | 0.662740 |
6 | -1.102513 | -1.256220 | 0.539102 | 0.205678 |
7 | 1.485017 | 1.588052 | -0.269551 | 1.119803 |
8 | 1.597518 | 1.114007 | -0.539102 | 1.119803 |
9 | 0.360004 | 0.639961 | -0.808653 | 1.119803 |
10 | 0.472505 | 1.114007 | -0.134776 | 0.205678 |
11 | -0.090001 | -0.308130 | 2.021633 | 0.662740 |
12 | 0.922511 | -0.782175 | 0.269551 | -0.708447 |
13 | 0.247503 | -0.782175 | 2.695510 | 1.119803 |
14 | 0.922511 | 0.165916 | -0.134776 | -0.251384 |
15 | 0.022500 | -0.308130 | 0.673878 | -0.708447 |
16 | 0.922511 | -0.308130 | -0.134776 | -1.165509 |
17 | -1.102513 | -1.256220 | -0.673878 | -1.622571 |
18 | -1.327515 | -1.256220 | -0.404327 | 1.119803 |
19 | -0.202502 | -0.308130 | 0.673878 | -1.622571 |
数据标准化:也可以用sklearn包
from sklearn import preprocessing
#Z-Score标准化
#建立StandardScaler对象
zscore = preprocessing.StandardScaler()
# 标准化处理
data_zs = zscore.fit_transform(data)
注意:sklearn这种处理求标准差时分母为n,而我们下面的std计算时分母为n-1,Spss里的计算分母也为n-1。
sklearn降维:
pca=dp.PCA(n_components=2) #加载pca算法,设置降维后主成分数目为2
reduced_x=pca.fit_transform(x) #对原始数据进行降维,保存在reduced_x中
数据标准化代码:
import pandas as pd
import numpy as np
csv_data = pd.read_csv('C:/Users/admin/Desktop/2019.10.05/算法/主成分分析/data.csv') # 读取训练数据
csv_data=csv_data.drop('序号', axis=1) #去掉序号那一列
describe=csv_data.describe() # 对每一列数据进行统计,包括计数,均值,std,各个分位数等。
mean=describe.loc['mean']
std=describe.loc['std']
m=csv_data.index.size #行数
n=csv_data.columns.size #列数
column=csv_data.columns.values #['x1' 'x2' 'x3' 'x4']
#实现对数据框里的每个元素进行相关操作
for i in range(0,m):
for j in range(0,n):
csv_data.iloc[i,j]=(csv_data.iloc[i,j]-mean[j])/std[j] #第i行,第j列
print("标准化后的数据:\n",csv_data)
主成分分析:
import pandas as pd
import math
import numpy as np
from scipy import linalg
csv_data = pd.read_csv('C:/Users/admin/Desktop/2019.10.05/算法/主成分分析/data.csv') # 读取训练数据
csv_data=csv_data.drop('序号', axis=1) #去掉序号那一列
corr = csv_data.corr() #求变量之间的相关系数,判断是否可以进行主成分分析
print("原始数据:\n",csv_data)
print("\n相关系数矩阵:\n",corr)
describe=csv_data.describe() # 对每一列数据进行统计,包括计数,均值,std,各个分位数等。
mean=describe.loc['mean']
std=describe.loc['std']
a=list(csv_data['x1'])
x11=[]
for i in range(0,20):
x11.append((a[i]-mean['x1'])/std['x1'])
b=list(csv_data['x2'])
x22=[]
for i in range(0,20):
x22.append((b[i]-mean['x2'])/std['x2'])
c=list(csv_data['x3'])
x33=[]
for i in range(0,20):
x33.append((c[i]-mean['x3'])/std['x3'])
d=list(csv_data['x4'])
x44=[]
for i in range(0,20):
x44.append((d[i]-mean['x4'])/std['x4'])
arr=np.array([x11,x22,x33,x44]) #中心化后的数据
print("\n标准化后的数据:\n",arr.T)
M=corr.values #将相关系数转为矩阵
eig,vec=np.linalg.eig(M) #计算矩阵的特征值、特征向量。eig是list类型,vec是<class 'numpy.ndarray'>类型
per=[] #贡献率的计算
for i in range(0,4):
per.append(eig[i]/sum(eig))
print("\n相关系数矩阵的特征值:\n",eig)
# vec1=vec[[:]][:,[1,3,2,0]]
per=sorted(per,reverse=True) #贡献率排序(从大到小)
print("\n贡献率排序:\n",per)
print("\n累计贡献率:\n",np.array(per).cumsum()) #贡献率的累计计算
#定义单位正交化的函数
def gram_schmidt(A):
"""Gram-schmidt正交化"""
global Q #必须申明为全局变量,否则无法调用Q
Q=np.zeros_like(A)
cnt = 0
for a in A.T:
u = np.copy(a)
for i in range(0, cnt):
u -= np.dot(np.dot(Q[:, i].T, a), Q[:, i]) # 减去待求向量在已求向量上的投影
e = u / np.linalg.norm(u) # 归一化
Q[:, cnt] = e
cnt += 1
R = np.dot(Q.T, A)
print("\n正交单位化后的特征向量:")
print(Q.T)
gram_schmidt(vec)
print("\n按特征值大小排列的正交单位化后的特征向量:")
print(Q.T[[1,3,2,0][:]])
y=np.dot(arr.T,Q.T[[1,3,2,0][:]].T)
Y=pd.DataFrame(y)
Y.rename(columns={0:'Y1',1:'Y2', 2:'Y3',3:'Y4'}, inplace = True)
print("\n主成分的值(得分):\n",Y)
print("\n主成分相关系数矩阵:")
corr1=Y.corr()
print(corr1)
result = csv_data.join(Y,how='inner')
print("\n原始数据和主成分得分:")
print(result)
corr2=result.corr()
print("\n原始数据和主成分得分之间的相关系数:")
print(corr2.iloc[0:4, 4:8])
输出结果:
相关系数矩阵:
| x1 | x2 | x3 | x4 |
x1 | 1.000000 | 0.694984 | 0.219456 | 0.024898 |
x2 | 0.694984 | 1.000000 | -0.147955 | 0.135133 |
x3 | 0.219456 | -0.147955 | 1.000000 | 0.071327 |
x4 | 0.024898 | 0.135133 | 0.071327 | 1.000000 |
相关系数矩阵的特征值:
[0.20686561 1.71825161 0.98134701 1.09353577]
贡献率排序:
[0.42956290217587323, 0.2733839423331357, 0.2453367536310203, 0.05171640185997065]
累计贡献率:
[0.4295629 0.70294684 0.9482836 1. ]
正交单位化后的特征向量:
[[ 0.66588327 -0.66355498 -0.31889547 0.12083021]
[-0.69996363 -0.6897981 -0.08793923 -0.16277651]
[-0.24004879 0.05846333 -0.27031356 0.93053167]
[ 0.09501037 -0.28364662 0.9041587 0.30498307]]
按特征值大小排列的正交单位化后的特征向量:
[[-0.69996363 -0.6897981 -0.08793923 -0.16277651]
[ 0.09501037 -0.28364662 0.9041587 0.30498307]
[-0.24004879 0.05846333 -0.27031356 0.93053167]
[ 0.66588327 -0.66355498 -0.31889547 0.12083021]]
| Y1 | Y2 | Y3 | Y4 |
0 | 1.218105 | -1.451999 | -0.048273 | -0.185493 |
1 | 1.706942 | -1.210208 | 0.430341 | -0.040449 |
2 | -0.383874 | -0.242355 | 0.775589 | -0.393455 |
3 | -2.075835 | -0.594474 | -1.801057 | -0.854286 |
4 | -0.663462 | -0.864250 | 0.949030 | -0.536093 |
5 | 1.475180 | -0.078406 | 1.025942 | -0.230850 |
6 | 1.557370 | 0.801735 | 0.236877 | -0.047638 |
7 | -2.293467 | -0.211550 | 0.851241 | 0.156353 |
8 | -2.021514 | -0.310116 | 0.869385 | 0.631779 |
9 | -0.804599 | -0.536949 | 1.211598 | 0.208253 |
10 | -1.120804 | -0.330221 | 0.179526 | -0.356740 |
11 | -0.010115 | 2.108850 | 0.073817 | -0.420080 |
12 | -0.014567 | 0.337162 | -0.999271 | 0.961740 |
13 | -0.053019 | 3.024066 | 0.208238 | -0.040456 |
14 | -0.707401 | -0.157940 | -0.409237 | 0.516795 |
15 | 0.252856 | 0.482766 | -0.864806 | -0.081055 |
16 | -0.231607 | -0.302271 | -1.287573 | 0.720896 |
17 | 1.961634 | -0.852576 | -1.136482 | 0.118267 |
18 | 1.649029 | 0.206141 | 1.396532 | 0.213845 |
19 | 0.559148 | 0.182595 | -1.661416 | -0.341334 |
主成分相关系数矩阵:
| Y1 | Y2 | Y3 | Y4 |
Y1 | 1.000000e+00 | -2.120752e-16 | -4.499891e-17 | 7.693762e-16 |
Y2 | -2.120752e-16 | 1.000000e+00 | 1.974226e-16 | -6.972072e-16 |
Y3 | -4.499891e-17 | 1.974226e-16 | 1.000000e+00 | 2.075015e-16 |
Y4 | 7.693762e-16 | -6.972072e-16 | 2.075015e-16 | 1.000000e+00 |
原始数据和主成分得分:
| x1 | x2 | x3 | x4 | Y1 | Y2 | Y3 | Y4 |
0 | 40 | 2.0 | 5 | 20 | 1.218105 | -1.451999 | -0.048273 | -0.185493 |
1 | 10 | 1.5 | 5 | 30 | 1.706942 | -1.210208 | 0.430341 | -0.040449 |
2 | 120 | 3.0 | 13 | 50 | -0.383874 | -0.242355 | 0.775589 | -0.393455 |
3 | 250 | 4.5 | 18 | 0 | -2.075835 | -0.594474 | -1.801057 | -0.854286 |
4 | 120 | 3.5 | 9 | 50 | -0.663462 | -0.864250 | 0.949030 | -0.536093 |
5 | 10 | 1.5 | 12 | 50 | 1.475180 | -0.078406 | 1.025942 | -0.230850 |
6 | 40 | 1.0 | 19 | 40 | 1.557370 | 0.801735 | 0.236877 | -0.047638 |
7 | 270 | 4.0 | 13 | 60 | -2.293467 | -0.211550 | 0.851241 | 0.156353 |
8 | 280 | 3.5 | 11 | 60 | -2.021514 | -0.310116 | 0.869385 | 0.631779 |
9 | 170 | 3.0 | 9 | 60 | -0.804599 | -0.536949 | 1.211598 | 0.208253 |
10 | 180 | 3.5 | 14 | 40 | -1.120804 | -0.330221 | 0.179526 | -0.356740 |
11 | 130 | 2.0 | 30 | 50 | -0.010115 | 2.108850 | 0.073817 | -0.420080 |
12 | 220 | 1.5 | 17 | 20 | -0.014567 | 0.337162 | -0.999271 | 0.961740 |
13 | 160 | 1.5 | 35 | 60 | -0.053019 | 3.024066 | 0.208238 | -0.040456 |
14 | 220 | 2.5 | 14 | 30 | -0.707401 | -0.157940 | -0.409237 | 0.516795 |
15 | 140 | 2.0 | 20 | 20 | 0.252856 | 0.482766 | -0.864806 | -0.081055 |
16 | 220 | 2.0 | 14 | 10 | -0.231607 | -0.302271 | -1.287573 | 0.720896 |
17 | 40 | 1.0 | 10 | 0 | 1.961634 | -0.852576 | -1.136482 | 0.118267 |
18 | 20 | 1.0 | 12 | 60 | 1.649029 | 0.206141 | 1.396532 | 0.213845 |
19 | 120 | 2.0 | 20 | 0 | 0.559148 | 0.182595 | -1.661416 | -0.341334 |
原始数据和主成分得分之间的相关系数:
| Y1 | Y2 | Y3 | Y4 |
x1 | -0.917527 | 0.099354 | -0.237799 | 0.302860 |
x2 | -0.904202 | -0.296616 | 0.057916 | -0.301801 |
x3 | -0.115273 | 0.945499 | -0.267781 | -0.145042 |
x4 | -0.213371 | 0.318928 | 0.921812 | 0.054957 |