用Python 来做 Market Segmentation with Cluster Analysis

本文通过实例讲解K-means与层次聚类算法在客户满意度与忠诚度数据上的应用,对比两种方法的优缺点,包括清晰的聚类结果可视化,以及如何使用Elbow方法确定最佳聚类数量。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.cluster import KMeans

data = pd.read_csv('3.12. Example.csv')
print(data)

在这里插入图片描述

代码紧跟上面

# Plot the data
plt.scatter(data['Satisfaction'], data['Loyalty'])
plt.xlabel('Satisfaction')
plt.ylabel('Loyalty')
plt.show()

在这里插入图片描述

# Select the features
x = data.copy()

# Clustering
kmeans = KMeans(2)
kmeans.fit(x)

在这里插入图片描述

# Clustering results
clusters = x.copy()
clusters['cluster_pred'] = kmeans.fit_predict(x)

plt.scatter(clusters['Satisfaction'], clusters['Loyalty'], c=clusters['cluster_pred'], cmap='rainbow')
plt.xlable('Satisfaction')
plt.ylable('Loyalty')
plt.show()

在这里插入图片描述

# Standaridze the variables
from sklearn import preprocessing

x_scaled = preprocessing.scale(x)
print(x_scaled) # x_scaled contains the standardized 'Satisfaction' and the same values for 'Loyalty'

在这里插入图片描述

# Take advantage of the Elbow method
wcss = []
for i in range(1,10):
  kmeans = KMeans(i)
  kmeans.fit(x_scaled)
  wcss.append(kmeans.inertia_)

print(wcss)

在这里插入图片描述

plt.plot(range(1,10), wcss)
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')

在这里插入图片描述

# Explore clustering solutions and select the number of clusters
kmeans_new = KMeans(2)
kmeans_new.fit(x_scaled)
clusters_new = x.copy()
clusters_new['cluster_pred'] = kmeans_new.fit_predict(x_scaled)
print(clusters_new)

在这里插入图片描述

plt.scatter(clusters_new['Satisfaction'], clusters_new['Loyalty'], c=clusters_new['cluster_pred'], cmap='rainbow')
plt.xlabel('Satisfaction')
plt.ylabel('Loyalty')
plt.show()
"""
We often choose to plot using the original values for clearer interpretability. Note: the discrepancy we observe here depends on the range of the axes, too. 
"""

在这里插入图片描述

# Explore clustering solutions and select the number of clusters
kmeans_new = KMeans(4)
kmeans_new.fit(x_scaled)
clusters_new = x.copy()
clusters_new['cluster_pred'] = kmeans_new.fit_predict(x_scaled)
print(clusters_new)

plt.scatter(clusters_new['Satisfaction'], clusters_new['Loyalty'], c=clusters_new['cluster_pred'], cmap='rainbow')
plt.xlabel('Satisfaction')
plt.ylabel('Loyalty')
plt.show()

在这里插入图片描述

  1. Types of analysis:
    [1] Exploratory
    — Get acquainted with the data
    — Search for patterns
    — Plan
    [2] Confirmatory
    [3] Explanatory

  2. There are two types of clustering: Flat and Hierarchical

  3. K means is a flat method in the sense that there is no hierarchy but rather we choose the number of clusters and the magic happens the other type is herarchical.

  4. There are two types of hierarchical clustering agglomerative (bottom-up) and divisive (Top-Down).

  5. With k-means we can simulate this divisive technique and that’s what we did with the elbow method.

  6. Agglomerated and divisive clustering should reach similar results but agglomerated is much easier to solve mathematically.

  7. Dendrogram: This solution has been produced on the same dataset based on Longitude and Latitude and we have standardized the variables. By the way, standardization did not make a difference in this case.

  8. The bigger the distance between two links, the bigger the difference in terms of the features.

  9. The pros of the Dendrogram:
    [1] Hierarchical clustering shows all the possible linkages between clusters.
    [2] We understand the data much, much better
    [3] No need to preset the number of clusters (like with k-means)
    [4] Many methods to perform hierarchical clustering (Ward method)

  10. The cons of the Dendrogram:
    [1] It is also one of the reasons why hierarchical clustering is far from amazing is scalability.
    [2] It is extremely computationally expensive.
    [3] The more observations there are the slower it gets.
    [4] K means hardly has this issues.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv('Country clusters standardized.csv', index_col='Country)

x_scaled = data.copy()
x_scaled = x_scaled.drop(['Language'], axis = 1)
print(x_scaled)

在这里插入图片描述

接着上面代码

sns.clustermap(x_scaled, cmap='mako')

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值