对连续变量和分类变量混合的数据进行聚类,用了两种方法,k-prototypes和gower距离+kmeans,两种方法都是python直接编写没调包。
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import metrics
import seaborn as sns
plt.rcParams['font.sans-serif'] = ['SimHei'] # 绘图时可以显示中文
plt.rcParams['axes.unicode_minus']=False # 绘图时显示负号
warnings.filterwarnings("ignore") # 不要显示警告
heart = pd.read_csv('C:\\Users\\91333\\Documents\\semester6\\data science\\6.聚类分析\\heart.dat', header = None, sep = ' ')
可视化
sns.stripplot(x=heart.iloc[:,13],y=heart.iloc[:,3],hue=heart.iloc[:,1])
sns.boxplot(x=heart.iloc[:,2],y=heart.iloc[:,4],hue=heart.iloc[:,13])
sns.violinplot(x=heart.iloc[:,5],y=heart.iloc[:,7],hue=heart.iloc[:,13])
#对连续型数据标准化,减小量纲对聚类的影响。
#第十一个变量为顺序型变量,有大小关系,在这里我把它看成连续型变量
Numerical = [0,3,4,7,9,10,11]
Type = [1,2,5,6,8,12,13]
heart_norm0 = heart.iloc[:, Numerical].apply(lambda x: (x - np.mean(x)) / (np.std(x)))
heart_norm = heart
heart_norm.iloc[:,Numerical] = heart_norm0
方法一:使用k-means的变体k-prototypes聚类
k-prototypes代码,来源网络他人智慧
原链接:https://blog.youkuaiyun.com/littlely_ll/article/details/80042928?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.nonecase&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.nonecase
我似乎稍微改了一下下
prototype是考研词汇哦,意思是 原型
import numpy as np
import random
from collections import Counter
def dist(x, y):
return np.sqrt(sum((x-y)**2))
def sigma(x, y):
return len(x) - sum(x == y)
def KPrototypes(data, O, C, k, max_iters=10, gamma=0):
data = np.array(data)
m, n = data.shape
num = random.sample(range(m), k)
O_data = data[:, O]
C_data = data[:, C]
O_protos = O_data[num, :]
C_protos = C_data[num, :]
C_data = C_data.a