[问题背景]
假定有这样的数据集:
使用Python编制字典如下:
data = [[3, 5, 3, 6],
[4, 3, 5, 8],
[5, 1, 4, 10],
[6, 3, 2, 13],
[19, 23, 32, 101],
[20, 23, 45, 106],
[23, 6, 7, 69],
[24, 11, 44, 73],
[25, 2, 3, 129],
[26, 3, 2, 133],
[21, 1, 23, 110],
[22, 12, 11, 115],
[23, 2, 43, 120],
[24, 7, 9, 124],
[15, 5, 4, 43],
[16, 6, 7, 46],
[17, 1, 4, 49],
[18, 2, 3, 53],
[27, 4, 4, 138],
[29, 5, 6, 143],
[7, 2, 4, 15],
[8, 14, 8, 17],
[9, 22, 33, 20],
[10, 43, 57, 22],
[11, 1, 32, 24],
[12, 2, 34, 27],
[19, 4, 6, 56],
[20, 3, 5, 59],
[21, 3, 4, 63],
[22, 3, 22, 66]
]
target = [0, 0, 0, 0, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
Data = {'data':data, 'target':target}
字典Data包含data和target,其中data是包含30个四维数据条小列表的大列表,target是30个数据条对应标签的列表,标签取值有三类{0,1,2}
如何将此数据集降成二维?
[问题分析]
导入sklearn.decomposition中的PCA (decomposition本意为“分解”),以便作PCA(主成分分析)降维
导入matplotlib.pyplot,以便描点作图
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
使用Python编制字典数据集,代码在上面[问题背景]中已经给出。
将数据部分复制到X,将标签部分复制到y:
X = Data['data']
y = Data['target']
用PCA()生成pca降维器,设置目标维度数n_components=2,即将数据降至二维:
pca = PCA(n_components=2)
将数据X传入降维器pca的接口fit_transform(),降维后的数据返回给reduced_X:
reduced_X = pca.fit_transform(X)
这时print(reduced_X),可以看到:
原本的30条4维数据被转化为了30条2维数据。
下面对三类数据描点作图:
生成三类数据的x和y轴坐标列表:
A_x, A_y = [], []
B_x, B_y = [], []
C_x, C_y = [], []
使用标签信息将二维数据填入三类数据x.y.坐标列表:
for i in range(len(reduced_X)):
if (y[i] == 0):
A_x.append(reduced_X[i][0])
A_y.append(reduced_X[i][1])
elif (y[i] == 1):
B_x.append(reduced_X[i][0])
B_y.append(reduced_X[i][1])
elif (y[i] == 2):
C_x.append(reduced_X[i][0])
C_y.append(reduced_X[i][1])
最后使用plt.scatter()为plt对象描绘散点图,使用plt.show()显示绘图内容
其中plt.scatter()的前两个参数为坐标列表,第三个参数c为颜色(取值:'r'红, 'b'蓝, 'g'绿, 等等),第四个参数为散点形状(取值‘D’菱方块, ‘.‘小圆点, 'x'叉叉)
plt.scatter(A_x, A_y, c='y', marker='s')
plt.scatter(B_x, B_y, c='b', marker='x')
plt.scatter(C_x, C_y, c='g', marker='.')
plt.show()
最后附上完整代码:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
data = [[3, 5, 3, 6],
[4, 3, 5, 8],
[5, 1, 4, 10],
[6, 3, 2, 13],
[19, 23, 32, 101],
[20, 23, 45, 106],
[23, 6, 7, 69],
[24, 11, 44, 73],
[25, 2, 3, 129],
[26, 3, 2, 133],
[21, 1, 23, 110],
[22, 12, 11, 115],
[23, 2, 43, 120],
[24, 7, 9, 124],
[15, 5, 4, 43],
[16, 6, 7, 46],
[17, 1, 4, 49],
[18, 2, 3, 53],
[27, 4, 4, 138],
[29, 5, 6, 143],
[7, 2, 4, 15],
[8, 14, 8, 17],
[9, 22, 33, 20],
[10, 43, 57, 22],
[11, 1, 32, 24],
[12, 2, 34, 27],
[19, 4, 6, 56],
[20, 3, 5, 59],
[21, 3, 4, 63],
[22, 3, 22, 66]
]
target = [0, 0, 0, 0, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
Data = {'data':data, 'target':target}
X = Data['data']
y = Data['target']
pca = PCA(n_components=2)
reduced_X = pca.fit_transform(X)
A_x, A_y = [], []
B_x, B_y = [], []
C_x, C_y = [], []
for i in range(len(reduced_X)):
if (y[i] == 0):
A_x.append(reduced_X[i][0])
A_y.append(reduced_X[i][1])
elif (y[i] == 1):
B_x.append(reduced_X[i][0])
B_y.append(reduced_X[i][1])
elif (y[i] == 2):
C_x.append(reduced_X[i][0])
C_y.append(reduced_X[i][1])
plt.scatter(A_x, A_y, c='y', marker='s')
plt.scatter(B_x, B_y, c='b', marker='x')
plt.scatter(C_x, C_y, c='g', marker='.')
plt.show()
本文代码参考中国MOOC大学礼欣老师课程《Python机器学习应用》中降维部分编写讲解,因为考虑到直接使用sklearn.datasets中的鸢尾花数据集load_iris不易让读者理解数据预处理的对象内部结构,而鸢尾花数据集实际上就是包含了数据和标签的字典,因此自行编制数据集以突出其数据预处理过程。