本案例由开发者:天津师范大学协同育人项目--翟羽佳提供
1 概述
1.1 案例介绍
随着快消品行业智能化升级,中国生鲜零售市场规模在2023年达到5.2万亿元(数据来源:易观分析),消费者需求呈现场景化、健康化趋势。传统商超面临双重困境:
价值挖掘不足:行业客户终身价值(LTV)开发率仅为42%(麦肯锡2024研究),商品组合缺乏精准匹配;
运营成本高企:某知名生鲜平台单品促销成本占销售额18%,损耗率超行业均值2.3倍。
在此背景下,基于消费行为的客户分群成为战略重点。通过无监督学习构建多维客户画像,已成为零售企业优化库存周转、提升交叉销售的核心杠杆。
通过聚类优化与业务场景融合,项目创造三重价值:
技术创新:应用 t-SNE 可视化高维特征分布,结合DBSCAN算法识别离群噪声点;
业务突破:
• 发现"有机食品忠诚客群"(占比22%,月度复购率78%);
• 识别"价格敏感型家庭"(高客单低频次),设计大包装组合促销策略;
商业回报:预计半年内实现滞销品周转率提升 40%,精准营销 ROI 突破1:6。
1.2 适用对象
- 零售商客户
- 个人开发者
- 高校学生
1.3 案例时间
本案例总时长预计60分钟。
1.4 案例流程
{{{width="50%" height="auto"}}}
说明:
- 登录开发者空间,启动Notebook;
- 在Notebook中编写代码运行调试。
1.5 资源总览
本案例预计花费0元。
| 资源名称 | 规格 | 单价(元) | 时长(分钟) |
|---|---|---|---|
| 开发者空间-Notebook | NPU basic · 1 * NPU 910B · 8v CPU · 24GB euler2.9-py310-torch2.1.0-cann8.0-openmind0.9.1-notebook | 免费 | 60 |
2 资源与环境准备
2.1 启动Notebook
参考“DeepSeek模型API调用及参数调试(开发者空间Notebook版)”案例的第2.2章节启动Notebook。
2.2 安装依赖库
在Notebook的新执行框中输入如下代码并运行,安装所有依赖库。
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install scikit-learn
!pip install seaborn
!pip install yellowbrick

注意:如果上述方式安装失败,可以使用国内镜像和最新库名进行安装,命令行如下:
!pip install -i https://pypi.tuna.tsinghua.edu.cn/simple numpy pandas matplotlib scikit-learn seaborn yellowbrick
当安装完成后,系统会返回所有已成功安装的库,如下图安装第三方库安装成功,通过pip list进行所有第三方库的检查
pip list

3 代码运行及结果展示
3.1 导入必要的库
numpy (np):用于 n 维数组处理和数值计算的三方库。
pandas(pd):用于数据分析、数据处理的三方库。
matplotlib:Python中的2D绘图库。
scikit-learn:是一个基于Python的开源机器学习库,提供分类、回归、聚类、降维等算法, 并集成数据预处理、模型评估等功能,广泛应用于数据分析和人工智能领域。
seaborn:是基于Python的Matplotlib的数据可视化库。
yellowbrick:是一个用于可视化机器学习模型和评估性能的Python库。
3.2 数据加载与预处理
数据准备:将如下链接数据集下载,并通过notebook上传。
https://case-aac4.obs.cn-north-4.myhuaweicloud.com/1_marketing_campaign.csv
下载到本地的数据:

将文件拖拽到Notebook左侧代码同级目录下,数据上传成功如下图所示:

3.3 运行代码
在Notebook的新执行框中输入如下代码并运行:
#Importing the Libraries
import numpy as np
import pandas as pd
import datetime
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import colors
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt, numpy as np
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import AgglomerativeClustering
from matplotlib.colors import ListedColormap
from sklearn import metrics
import warnings
import sys
if not sys.warnoptions:
warnings.simplefilter("ignore")
np.random.seed(42)
#Loading the dataset
data = pd.read_csv("1_marketing_campaign.csv", sep="\t")
print("Number of datapoints:", len(data))
data.head()
#Information on features
data.info()
data = data.dropna()
print("The total number of data-points after removing the rows with missing values are:", len(data))
data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"], dayfirst=True)
dates = []
for i in data["Dt_Customer"]:
i = i.date()
dates.append(i)
# Dates of the newest and oldest recorded customer
print("The newest customer's enrolment date in the records:", max(dates))
print("The oldest customer's enrolment date in the records:", min(dates))
#Created a feature "Customer_For"
days = []
d1 = max(dates) #taking it to be the newest customer
for i in dates:
delta = d1 - i
days.append(delta)
data["Customer_For"] = days
data["Customer_For"] = pd.to_numeric(data["Customer_For"], errors="coerce")
print("Total categories in the feature Marital_Status:\n", data["Marital_Status"].value_counts(), "\n")
print("Total categories in the feature Education:\n", data["Education"].value_counts())
#Feature Engineering
#Age of customer today
data["Age"] = 2021-data["Year_Birth"]
#Total spendings on various items
data["Spent"] = data["MntWines"]+ data["MntFruits"]+ data["MntMeatProducts"]+ data["MntFishProducts"]+ data["MntSweetProducts"]+ data["MntGoldProds"]
#Deriving living situation by marital status"Alone"
data["Living_With"]=data["Marital_Status"].replace({"Married":"Partner", "Together":"Partner", "Absurd":"Alone", "Widow":"Alone", "YOLO":"Alone", "Divorced":"Alone", "Single":"Alone",})
#Feature indicating total children living in the household
data["Children"]=data["Kidhome"]+data["Teenhome"]
#Feature for total members in the householde
data["Family_Size"] = data["Living_With"].replace({"Alone": 1, "Partner":2})+ data["Children"]
#Feature pertaining parenthood
data["Is_Parent"] = np.where(data.Children> 0, 1, 0)
#Segmenting education levels in three groups
data["Education"]=data["Education"].replace({"Basic":"Undergraduate","2n Cycle":"Undergraduate", "Graduation":"Graduate", "Master":"Postgraduate", "PhD":"Postgraduate"})
#For clarity
data=data.rename(columns={"MntWines": "Wines","MntFruits":"Fruits","MntMeatProducts":"Meat","MntFishProducts":"Fish","MntSweetProducts":"Sweets","MntGoldProds":"Gold"})
#Dropping some of the redundant features
to_drop = ["Marital_Status", "Dt_Customer", "Z_CostContact", "Z_Revenue", "Year_Birth", "ID"]
data = data.drop(to_drop, axis=1)
data.describe()
#To plot some selected features
#Setting up colors prefrences
sns.set(rc={"axes.facecolor":"#FFF9ED","figure.facecolor":"#FFF9ED"})
pallet = ["#682F2F", "#9E726F", "#D6B2B1", "#B9C0C9", "#9F8A78", "#F3AB60"]
cmap = colors.ListedColormap(["#682F2F", "#9E726F", "#D6B2B1", "#B9C0C9", "#9F8A78", "#F3AB60"])
#Plotting following features
To_Plot = [ "Income", "Recency", "Customer_For", "Age", "Spent", "Is_Parent"]
print("Reletive Plot Of Some Selected Features: A Data Subset")
plt.figure()
sns.pairplot(data[To_Plot], hue= "Is_Parent",palette= (["#682F2F","#F3AB60"]))
#Taking hue
plt.show()
#Dropping the outliers by setting a cap on Age and income.
data = data[(data["Age"]<90)]
data = data[(data["Income"]<600000)]
print("The total number of data-points after removing the outliers are:", len(data))
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# 选择数值列
numeric_data = data.select_dtypes(include=[float, int])
# 计算相关性矩阵
corrmat = numeric_data.corr()
# 绘制热力图
plt.figure(figsize=(20, 20))
sns.heatmap(corrmat, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix Heatmap')
plt.show()
#Get list of categorical variables
s = (data.dtypes == 'object')
object_cols = list(s[s].index)
print("Categorical variables in the dataset:", object_cols)
#Label Encoding the object dtypes.
LE=LabelEncoder()
for i in object_cols:
data[i]=data[[i]].apply(LE.fit_transform)
print("All features are now numerical")
#Creating a copy of data
ds = data.copy()
# creating a subset of dataframe by dropping the features on deals accepted and promotions
cols_del = ['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1','AcceptedCmp2', 'Complain', 'Response']
ds = ds.drop(cols_del, axis=1)
#Scaling
scaler = StandardScaler()
scaler.fit(ds)
scaled_ds = pd.DataFrame(scaler.transform(ds),columns= ds.columns )
print("All features are now scaled")
#Scaled data to be used for reducing the dimensionality
print("Dataframe to be used for further modelling:")
scaled_ds.head()
#Initiating PCA to reduce dimentions aka features to 3
pca = PCA(n_components=3)
pca.fit(scaled_ds)
PCA_ds = pd.DataFrame(pca.transform(scaled_ds), columns=(["col1","col2", "col3"]))
PCA_ds.describe().T
#A 3D Projection Of Data In The Reduced Dimension
x =PCA_ds["col1"]
y =PCA_ds["col2"]
z =PCA_ds["col3"]
#To plot
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111, projection="3d")
ax.scatter(x,y,z, c="maroon", marker="o" )
ax.set_title("A 3D Projection Of Data In The Reduced Dimension")
plt.show()
# Quick examination of elbow method to find numbers of clusters to make.
print('Elbow Method to determine the number of clusters to be formed:')
Elbow_M = KElbowVisualizer(KMeans(), k=10)
Elbow_M.fit(PCA_ds)
Elbow_M.show()
#Initiating the Agglomerative Clustering model
AC = AgglomerativeClustering(n_clusters=4)
# fit model and predict clusters
yhat_AC = AC.fit_predict(PCA_ds)
PCA_ds["Clusters"] = yhat_AC
#Adding the Clusters feature to the orignal dataframe.
data["Clusters"]= yhat_AC
#Plotting the clusters
fig = plt.figure(figsize=(10,8))
ax = plt.subplot(111, projection='3d', label="bla")
ax.scatter(x, y, z, s=40, c=PCA_ds["Clusters"], marker='o', cmap = cmap )
ax.set_title("The Plot Of The Clusters")
plt.show()
#Plotting countplot of clusters
pal = ["#682F2F","#B9C0C9", "#9F8A78","#F3AB60"]
pl = sns.countplot(x=data["Clusters"], palette= pal)
pl.set_title("Distribution Of The Clusters")
plt.show()
pl = sns.scatterplot(data = data,x=data["Spent"], y=data["Income"],hue=data["Clusters"], palette= pal)
pl.set_title("Cluster's Profile Based On Income And Spending")
plt.legend()
plt.show()
plt.figure()
pl=sns.swarmplot(x=data["Clusters"], y=data["Spent"], color= "#CBEDDD", alpha=0.5 )
pl=sns.boxenplot(x=data["Clusters"], y=data["Spent"], palette=pal)
plt.show()
#Creating a feature to get a sum of accepted promotions
data["Total_Promos"] = data["AcceptedCmp1"]+ data["AcceptedCmp2"]+ data["AcceptedCmp3"]+ data["AcceptedCmp4"]+ data["AcceptedCmp5"]
#Plotting count of total campaign accepted.
plt.figure()
pl = sns.countplot(x=data["Total_Promos"],hue=data["Clusters"], palette= pal)
pl.set_title("Count Of Promotion Accepted")
pl.set_xlabel("Number Of Total Accepted Promotions")
plt.show()
#Plotting the number of deals purchased
plt.figure()
pl=sns.boxenplot(y=data["NumDealsPurchases"],x=data["Clusters"], palette= pal)
pl.set_title("Number of Deals Purchased")
plt.show()
Personal = [ "Kidhome","Teenhome","Customer_For", "Age", "Children", "Family_Size", "Is_Parent", "Education","Living_With"]
for i in Personal:
plt.figure()
sns.jointplot(x=data[i], y=data["Spent"], hue =data["Clusters"], kind="kde", palette=pal)
plt.show()
注意:代码中严格遵守缩进。

3.4 运行结果
数据结果:


图示结果:










6万+

被折叠的 条评论
为什么被折叠?



