本文使用sklearn的KMeans方法对无标签数据集,天气数据集minute_weather进行处理,同时尝试自己编写Kmeans算法进行数据处理,对比两者差异,检查自定义算法可用性。
数据集minute_weather:
链接:https://pan.baidu.com/s/1Ko6YK2xJNiDRsq2befcYsQ
提取码:wwww
说明
本次实验所使用的minute_weather数据集并不“干净”,所以先进行了数据清洗,其中“垃圾”数据包括空值数据,整列或整行为0的数据,最终由于数据过大无法有效处理,只使用其中1000行数据进行室验。
本次通过聚类后得出的数据可以看出,自己编写的kmeans算法与sklearn自带的kmeans算法所得聚类情况相差不大,由可视化图展示的结果来看,数据的聚类情况并不是很明显,分类之间无明显间隔,当然,这也与我们的可视化方式有关,若我采用其他两个参数作为[x,y]进行展示,或许会更加明显。
代码实现
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
path = 'D:\myData\code\python\data\\minute_weather.csv'
pdData = pd.read_csv(path)
pdData.head()
rowID | hpwren_timestamp | air_pressure | air_temp | avg_wind_direction | avg_wind_speed | max_wind_direction | max_wind_speed | min_wind_direction | min_wind_speed | rain_accumulation | rain_duration | relative_humidity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2011-09-10 00:00:49 | 912.3 | 64.76 | 97.0 | 1.2 | 106.0 | 1.6 | 85.0 | 1.0 | NaN | NaN | 60.5 |
1 | 1 | 2011-09-10 00:01:49 | 912.3 | 63.86 | 161.0 | 0.8 | 215.0 | 1.5 | 43.0 | 0.2 | 0.0 | 0.0 | 39.9 |
2 | 2 | 2011-09-10 00:02:49 | 912.3 | 64.22 | 77.0 | 0.7 | 143.0 | 1.2 | 324.0 | 0.3 | 0.0 | 0.0 | 43.0 |
3 | 3 | 2011-09-10 00:03:49 | 912.3 | 64.40 | 89.0 | 1.2 | 112.0 | 1.6 | 12.0 | 0.7 | 0.0 | 0.0 | 49.5 |
4 | 4 | 2011-09-10 00:04:49 | 912.3 | 64.40 | 185.0 | 0.4 | 260.0 | 1.0 | 100.0 | 0.1 | 0.0 | 0.0 | 58.8 |
数据预处理
pdData.iloc[:,2:].head()
air_pressure | air_temp | avg_wind_direction | avg_wind_speed | max_wind_direction | max_wind_speed | min_wind_direction | min_wind_speed | rain_accumulation | rain_duration | relative_humidity | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 912.3 | 64.76 | 97.0 | 1.2 | 106.0 | 1.6 | 85.0 | 1.0 | NaN | NaN | 60.5 |
1 | 912.3 | 63.86 | 161.0 | 0.8 | 215.0 | 1.5 | 43.0 | 0.2 | 0.0 | 0.0 | 39.9 |
2 | 912.3 | 64.22 | 77.0 | 0.7 | 143.0 | 1.2 | 324.0 | 0.3 | 0.0 | 0.0 | 43.0 |
3 | 912.3 | 64.40 | 89.0 | 1.2 | 112.0 | 1.6 | 12.0 | 0.7 | 0.0 | 0.0 | 49.5 |
4 | 912.3 | 64.40 | 185.0 | 0.4 | 260.0 | 1.0 | 100.0 | 0.1 | 0.0 | 0.0 | 58.8 |
weatherdata =pdData.iloc[:,2:]
weatherdata.head()
air_pressure | air_temp | avg_wind_direction | avg_wind_speed | max_wind_direction | max_wind_speed | min_wind_direction | min_wind_speed | rain_accumulation | rain_duration | relative_humidity | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 912.3 | 64.76 | 97.0 | 1.2 | 106.0 | 1.6 | 85.0 | 1.0 | NaN | NaN | 60.5 |
1 | 912.3 | 63.86 | 161.0 | 0.8 | 215.0 | 1.5 | 43.0 | 0.2 | 0.0 | 0.0 | 39.9 |
2 | 912.3 | 64.22 | 77.0 | 0.7 | 143.0 | 1.2 | 324.0 | 0.3 | 0.0 | 0.0 | 43.0 |
3 | 912.3 | 64.40 | 89.0 | 1.2 | 112.0 | 1.6 | 12.0 | 0.7 | 0.0 | 0.0 | 49.5 |
4 | 912.3 | 64.40 | 185.0 | 0.4 | 260.0 | 1.0 | 100.0 | 0.1 | 0.0 | 0.0 | 58.8 |
#清洗NUN等异常数据
#去除为空的记录
print("数据的形状为:", weatherdata.shape)
exp1 = weatherdata.notnull()
#exp2 = weatherdata["rain_duration"].notnull()
#exp = exp1 & exp2
weatherdata_notnull = weatherdata.loc[exp1,:]
print("删除缺失记录后数据的形状为:", weatherdata_notnull.shape)
weatherdata_notnull.head()
ValueError: Cannot index with multidimensional key
更换一个方法进行清洗:
#清洗NUN等异常数据
#去除为空的记录
print("数据的形状为:", weatherdata.shape)
#删除表中任何含有NaN的行
weatherdata_notnull = weatherdata.dropna(axis=0, how='any')
print("删除缺失记录后数据的形状为:", weatherdata_notnull.shape)
weatherdata_notnull.head()
数据的形状为: (1587257, 11)
删除缺失记录后数据的形状为: (1586823, 11)
air_pressure | air_temp | avg_wind_direction | avg_wind_speed | max_wind_direction | max_wind_speed | min_wind_direction | min_wind_speed | rain_accumulation | rain_duration | relative_humidity | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 912.3 | 63.86 | 161.0 | 0.8 | 215.0 | 1.5 | 43.0 | 0.2 | 0.0 | 0.0 | 39.9 |
2 | 912.3 | 64.22 | 77.0 | 0.7 | 143.0 | 1.2 | 324.0 | 0.3 | 0.0 | 0.0 | 43.0 |
3 | 912.3 | 64.40 | 89.0 | 1.2 | 112.0 | 1.6 | 12.0 | 0.7 | 0.0 | 0.0 | 49.5 |
4 | 912.3 | 64.40 | 185.0 | 0.4 | 260.0 | 1.0 | 100.0 | 0.1 | 0.0 | 0.0 | 58.8 |
5 | 912.3 | 63.50 | 76.0 | 2.5 | 92.0 | 3.0 | 61.0 | 2.0 | 0.0 | 0.0 | 62.6 |
#数据太大,处理时间过长,我们提取其中一部分处理
#获取前1000条数据
weatherdata_ture = weatherdata_notnull.iloc[:1000,:]
print("使用的数据形状:",weatherdata_ture.shape)
使用的数据形状: (1000, 11)
sklearn的k-means聚类
X = weatherdata_ture.values
model = KMeans(n_clusters=12)#构造聚类器,12类
model.fit(X)#聚类
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=12, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
label_pred = model.labels_
可视化
#绘制k-means结果
x0 = X[label_pred <