Kmeans实践：自定义算法对天气数据进行分类

最新推荐文章于 2024-04-06 21:48:52 发布

置顶

逻辑howe

最新推荐文章于 2024-04-06 21:48:52 发布

阅读量5.8k

点赞数 10

分类专栏：经验积累文章标签：聚类可视化 python kmeans python 机器学习

本文链接：https://blog.youkuaiyun.com/weixin_42049458/article/details/107162179

版权

本文利用sklearn的KMeans和自编KMeans对minute_weather数据集进行聚类，对比两者效果。数据预处理后选取1000行进行实验，结果显示两种算法聚类结果相似，但可视化效果不明显。KMeans算法优缺点明显，包括易实现、可解释性强，但K值选择困难、易陷入局部最优等。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本文使用sklearn的KMeans方法对无标签数据集，天气数据集minute_weather进行处理，同时尝试自己编写Kmeans算法进行数据处理，对比两者差异，检查自定义算法可用性。

数据集minute_weather：
链接：https://pan.baidu.com/s/1Ko6YK2xJNiDRsq2befcYsQ
提取码：wwww

说明

本次实验所使用的minute_weather数据集并不“干净”，所以先进行了数据清洗，其中“垃圾”数据包括空值数据，整列或整行为0的数据，最终由于数据过大无法有效处理，只使用其中1000行数据进行室验。

本次通过聚类后得出的数据可以看出，自己编写的kmeans算法与sklearn自带的kmeans算法所得聚类情况相差不大，由可视化图展示的结果来看，数据的聚类情况并不是很明显，分类之间无明显间隔，当然，这也与我们的可视化方式有关，若我采用其他两个参数作为[x,y]进行展示，或许会更加明显。

代码实现

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans

path = 'D:\myData\code\python\data\\minute_weather.csv'
pdData = pd.read_csv(path)
pdData.head()

	rowID	hpwren_timestamp	air_pressure	air_temp	avg_wind_direction	avg_wind_speed	max_wind_direction	max_wind_speed	min_wind_direction	min_wind_speed	rain_accumulation	rain_duration	relative_humidity
0	0	2011-09-10 00:00:49	912.3	64.76	97.0	1.2	106.0	1.6	85.0	1.0	NaN	NaN	60.5
1	1	2011-09-10 00:01:49	912.3	63.86	161.0	0.8	215.0	1.5	43.0	0.2	0.0	0.0	39.9
2	2	2011-09-10 00:02:49	912.3	64.22	77.0	0.7	143.0	1.2	324.0	0.3	0.0	0.0	43.0
3	3	2011-09-10 00:03:49	912.3	64.40	89.0	1.2	112.0	1.6	12.0	0.7	0.0	0.0	49.5
4	4	2011-09-10 00:04:49	912.3	64.40	185.0	0.4	260.0	1.0	100.0	0.1	0.0	0.0	58.8

数据预处理

pdData.iloc[:,2:].head()

	air_pressure	air_temp	avg_wind_direction	avg_wind_speed	max_wind_direction	max_wind_speed	min_wind_direction	min_wind_speed	rain_accumulation	rain_duration	relative_humidity
0	912.3	64.76	97.0	1.2	106.0	1.6	85.0	1.0	NaN	NaN	60.5
1	912.3	63.86	161.0	0.8	215.0	1.5	43.0	0.2	0.0	0.0	39.9
2	912.3	64.22	77.0	0.7	143.0	1.2	324.0	0.3	0.0	0.0	43.0
3	912.3	64.40	89.0	1.2	112.0	1.6	12.0	0.7	0.0	0.0	49.5
4	912.3	64.40	185.0	0.4	260.0	1.0	100.0	0.1	0.0	0.0	58.8

weatherdata =pdData.iloc[:,2:]
weatherdata.head()

	air_pressure	air_temp	avg_wind_direction	avg_wind_speed	max_wind_direction	max_wind_speed	min_wind_direction	min_wind_speed	rain_accumulation	rain_duration	relative_humidity
0	912.3	64.76	97.0	1.2	106.0	1.6	85.0	1.0	NaN	NaN	60.5
1	912.3	63.86	161.0	0.8	215.0	1.5	43.0	0.2	0.0	0.0	39.9
2	912.3	64.22	77.0	0.7	143.0	1.2	324.0	0.3	0.0	0.0	43.0
3	912.3	64.40	89.0	1.2	112.0	1.6	12.0	0.7	0.0	0.0	49.5
4	912.3	64.40	185.0	0.4	260.0	1.0	100.0	0.1	0.0	0.0	58.8

#清洗NUN等异常数据
#去除为空的记录
print("数据的形状为：", weatherdata.shape)
exp1 = weatherdata.notnull()
#exp2 = weatherdata["rain_duration"].notnull()
#exp = exp1 & exp2
weatherdata_notnull = weatherdata.loc[exp1,:]
print("删除缺失记录后数据的形状为：", weatherdata_notnull.shape)
weatherdata_notnull.head()

ValueError: Cannot index with multidimensional key

更换一个方法进行清洗：

#清洗NUN等异常数据
#去除为空的记录
print("数据的形状为：", weatherdata.shape)
#删除表中任何含有NaN的行
weatherdata_notnull = weatherdata.dropna(axis=0, how='any')
print("删除缺失记录后数据的形状为：", weatherdata_notnull.shape)
weatherdata_notnull.head()

数据的形状为： (1587257, 11)
删除缺失记录后数据的形状为： (1586823, 11)

	air_pressure	air_temp	avg_wind_direction	avg_wind_speed	max_wind_direction	max_wind_speed	min_wind_direction	min_wind_speed	relative_humidity
1	912.3	63.86	161.0	0.8	215.0	1.5	43.0	0.2	39.9
2	912.3	64.22	77.0	0.7	143.0	1.2	324.0	0.3	43.0
3	912.3	64.40	89.0	1.2	112.0	1.6	12.0	0.7	49.5
4	912.3	64.40	185.0	0.4	260.0	1.0	100.0	0.1	58.8
5	912.3	63.50	76.0	2.5	92.0	3.0	61.0	2.0	62.6

#数据太大，处理时间过长，我们提取其中一部分处理

#获取前1000条数据
weatherdata_ture = weatherdata_notnull.iloc[:1000,:]
print("使用的数据形状：",weatherdata_ture.shape)

使用的数据形状： (1000, 11)

sklearn的k-means聚类

X = weatherdata_ture.values
model = KMeans(n_clusters=12)#构造聚类器，12类
model.fit(X)#聚类

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=12, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

label_pred = model.labels_

可视化

#绘制k-means结果
x0 = X[label_pred <

最低0.47元/天解锁文章