【案例共创】基于机器学习的钻石电商定价策略优化：数据驱动的精准价格预测-优快云博客

本案例由开发者：天津师范大学协同育人项目--翟羽佳提供

1 概述

1.1 案例介绍

年复合增长率达 12.4%。然而，钻石作为高客单价、非标品类的代表，其价格受多维度因素影响（如 4C 标准、市场供需），传统定价模式依赖人工经验，存在主观性强、响应速度慢、利润空间难控等问题。对电商平台而言，用户行为数据（如浏览、比价、购买决策周期）是优化运营的关键。麦肯锡研究指出，数据驱动的动态定价策略可使企业利润提升 5-15%，但多数中小电商缺乏技术能力，面临用户转化率低促销资源浪费等痛点。

本案例通过数据科学手段，将机器学习技术与业务目标紧密结合，具体体现在以下几个方面：

聚类算法与用户标签体系：通过聚类算法对用户进行分群，构建用户标签体系，帮助企业精准识别高价值用户群体。

回归模型与价格预测：通过 XGBoost、随机森林等回归模型，准确预测钻石价格，帮助企业优化定价策略和库存管理。

数据驱动的营销优化：基于用户分群和价格预测结果，制定个性化的营销策略，提升广告投放的精准度和用户转化率，最终实现ROI的提升。

1.2 适用对象

数据科学与机器学习者
电商行业从业者
商业分析与战略制定者

1.3 案例时间

本案例总时长预计60分钟。

1.4 案例流程

{{{width="50%" height="auto"}}}

说明：

登录开发者空间，启动Notebook；
在Notebook中编写代码运行调试。

1.5 资源总览

本案例预计花费0元。

资源名称	规格	单价（元）	时长（分钟）
开发者空间-Notebook	NPU basic · 1 * NPU 910B · 8v CPU · 24GB euler2.9-py310-torch2.1.0-cann8.0-openmind0.9.1-notebook	免费	60

2 资源与环境准备

2.1 启动Notebook

参考“DeepSeek模型API调用及参数调试（开发者空间Notebook版）”案例的第2.2章节启动Notebook。

2.2 安装依赖库

打开终端：

安装代码中使用到的第三方库：

pip install numpy 
pip install pandas 
pip install matplotlib 
pip install scikit-learn 
pip install seaborn 
pip install xgboost

执行命令，结果如下：

注意：如果安装失败，可以使用国内镜像和最新库名进行安装：

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple numpy 
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pandas 
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple matplotlib 
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple scikit-learn
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple seaborn
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple xgboost

当安装完成后，系统会返回所有已成功安装的库，如下图安装第三方库安装成功，通过pip list查看所有已安装的第三方库：

pip list

3 代码运行及结果展示

3.1 导入必要的库

numpy：用于 n 维数组处理和数值计算的三方库。

pandas：用于数据分析、数据处理的三方库。

matplotlib：Python中的2D绘图库。

scikit-learn：是一个基于Python的开源机器学习库‌，提供分类、回归、聚类、降维等算法，并集成数据预处理、模型评估等功能，广泛应用于数据分析和人工智能领域。

seaborn：是基于 Python 的Matplotlib的数据可视化库。

xgboost：基于梯度提升（Gradient Boosting）的决策树算法，广泛应用于结构化数据的分类和回归任务‌。

3.2 数据加载与预处理

数据准备：将如下链接数据集下载，并通过notebook上传。

https://case-aac4.obs.cn-north-4.myhuaweicloud.com/diamonds.csv

下载到本地的数据：

将文件拖拽到Notebook左侧代码同级目录下，数据上传成功如下图所示：

数据集示例：

3.3 运行代码及结果展示

在Notebook的新执行框中输入如下代码并运行：

# 导入所需的 Python 库 
# NumPy: 用于高效数值计算，提供多维数组对象和工具 
import numpy as np 
# Pandas: 用于数据处理和分析，提供 DataFrame 和 Series 数据结构 
import pandas as pd 
# Seaborn: 基于 Matplotlib 的高级可视化库，用于绘制统计图表 
import seaborn as sns 
# Matplotlib: Python 的基础绘图库，用于创建静态、动画和交互式可视化 
import matplotlib as mpl 
# Matplotlib.pyplot: 提供一个类似 MATLAB 的绘图接口，用于快速创建图表 
import matplotlib.pyplot as plt 
# Matplotlib.pylab: 提供一个集成的接口，结合了 NumPy 和 Matplotlib 的功能 
import matplotlib.pylab as pylab 
# OneHotEncoder: 用于将分类变量转换为独热编码 
from sklearn.preprocessing import OneHotEncoder 
# LabelEncoder: 用于将分类标签转换为数值标签 
from sklearn.preprocessing import LabelEncoder 
# train_test_split: 用于将数据集划分为训练集和测试集 
from sklearn.model_selection import train_test_split 
# StandardScaler: 用于对特征进行标准化（均值为 0，标准差为 1） 
from sklearn.preprocessing import StandardScaler 
# PCA: 主成分分析，用于降维和特征提取 
from sklearn.decomposition import PCA# Pipeline: 用于构建机器学习流水线，简化模型构建过程 
from sklearn.pipeline import Pipeline 
# DecisionTreeRegressor: 决策树回归器，用于回归任务 
from sklearn.tree import DecisionTreeRegressor 
# RandomForestRegressor: 随机森林回归器，用于回归任务 
from sklearn.ensemble import RandomForestRegressor 
# LinearRegression: 线性回归模型，用于回归任务 
from sklearn.linear_model import LinearRegression 
# XGBRegressor: XGBoost 回归器，基于梯度提升的高效机器学习模型 
from xgboost import XGBRegressor 
# KNeighborsRegressor: K 近邻回归器，基于最近邻的回归算法 
from sklearn.neighbors import KNeighborsRegressor 
# cross_val_score: 用于交叉验证，评估模型性能 
from sklearn.model_selection import cross_val_score 
# mean_squared_error: 用于计算均方误差，评估回归模型的性能 
from sklearn.metrics import mean_squared_error 
# metrics: 提供多种评估指标，用于模型性能评估 
from sklearn import metrics 

if __name__ == '__main__':
    # 读取 CSV 文件 
    data = pd.read_csv("diamonds.csv") 
    data = data.head(1000)
    # 第一列看起来只是索引，因此将其删除 
    data = data.drop(["Unnamed: 0"], axis=1) 
    # 显示数据的描述性统计信息 
    data.describe() 
    # 删除尺寸为零的钻石记录 
    data = data.drop(data[data["x"]==0].index) 
    data = data.drop(data[data["y"]==0].index) 
    data = data.drop(data[data["z"]==0].index) 
    # 再次显示数据的形状，以确认删除操作 
    data.shape 
    # 设置色调 
    shade = ["#835656", "#baa0a0", "#ffc7c8", "#a9a799", "#65634a"] 
    # 绘制成对关系图，使用"cut"作为色调 
    ax = sns.pairplot(data, hue= "cut",palette=shade) 
    # 绘制价格与'y'特征的回归线 
    ax = sns.regplot(x="price", y="y", data=data, fit_reg=True, scatter_kws={"color": "#a9a799"}, line_kws={"color": "#835656"}) 
    ax.set_title("Regression Line on Price vs 'y'", color="#4e4c39") 
    # 绘制价格与'z'特征的回归线 
    ax= sns.regplot(x="price", y="z", data=data, fit_reg=True, scatter_kws={"color": "#a9a799"}, line_kws={"color": "#835656"}) 
    ax.set_title("Regression Line on Price vs 'z'", color="#4e4c39")# 绘制价格与深度特征的回归线 
    ax= sns.regplot(x="price", y="depth", data=data, fit_reg=True, scatter_kws={"color": "#a9a799"}, line_kws={"color": "#835656"}) 
    ax.set_title("Regression Line on Price vs Depth", color="#4e4c39") 
    # 绘制价格与台面特征的回归线 
    ax=sns.regplot(x="price", y="table", data=data, fit_reg=True, scatter_kws={"color": "#a9a799"}, line_kws={"color": "#835656"}) 
    ax.set_title("Regression Line on Price vs Table", color="#4e4c39") 
    # 删除异常值 
    data = data[(data["depth"]<75)&(data["depth"]>45)] 
    data = data[(data["table"]<80)&(data["table"]>40)] 
    data = data[(data["x"]<30)] 
    data = data[(data["y"]<30)] 
    data = data[(data["z"]<30)&(data["z"]>2)] 
    # 再次显示数据的形状，以确认删除操作 
    data.shape 
    # 使用 Seaborn 库的 pairplot 函数创建一个配对图 
    ax=sns.pairplot(data, hue= "cut",palette=shade) 
    # 获取分类变量的列表 
    s = (data.dtypes =="object") 
    object_cols = list(s[s].index) 
    print("Categorical variables:") 
    print(object_cols) 
    # 绘制切割质量与价格的提琴图 
    plt.figure(figsize=(12,8)) 
    ax = sns.violinplot(x="cut",y="price", hue='cut', data=data, palette=shade,legend=False) 
    ax.set_title("Violinplot For Cut vs Price", color="#4e4c39") 
    ax.set_ylabel("Price", color="#4e4c39") 
    ax.set_xlabel("Cut", color="#4e4c39") 
    # 绘制颜色与价格的提琴图 
    # 设置绘图尺寸 
    plt.figure(figsize=(12,8)) 
    # 设置颜色调色板 shade_1，用于绘制颜色与价格的提琴图 
    shade_1 = ["#835656","#b38182", "#baa0a0","#ffc7c8","#d0cd85", "#a9a799", "#65634a"] 
    # 绘制颜色与价格的提琴图，设置颜色调色板为 shade_1，按数量缩放 
    ax = sns.violinplot(x="color",y="price", hue='color',data=data, palette=shade_1,legend=False) 
    # 设置图表标题和坐标轴标签的颜色 
    ax.set_title("Violinplot For Color vs Price", color="#4e4c39") 
    ax.set_ylabel("Price", color="#4e4c39") 
    ax.set_xlabel("Color", color="#4e4c39")# 绘制净度与价格的提琴图 
    # 设置绘图尺寸 
    plt.figure(figsize=(12,8)) 
    # 设置颜色调色板 shade_2，用于绘制净度与价格的提琴图 
    shade_2 = ["#835656","#b38182", "#baa0a0","#ffc7c8","#f1f1f1","#d0cd85", "#a9a799", "#65634a"] 
    # 绘制净度与价格的提琴图，设置颜色调色板为 shade_2，按数量缩放 
    ax = sns.violinplot(x="clarity",y="price",hue='clarity', data=data, palette=shade_2,legend=False) 
    # 设置图表标题和坐标轴标签的颜色 
    ax.set_title("Violinplot For Clarity vs Price", color="#4e4c39") 
    ax.set_ylabel("Price", color="#4e4c39") 
    ax.set_xlabel("Clarity", color="#4e4c39") 
    # 创建数据副本，避免更改原始数据 
    label_data = data.copy() 
    # 创建 LabelEncoder 实例 
    label_encoder = LabelEncoder() 
    # 遍历分类变量列表，应用标签编码器 
    for col in object_cols: 
        label_data[col] = label_encoder.fit_transform(label_data[col]) 
    # 显示转换后的数据前五行 
    label_data.head() 
    # 显示数据的描述性统计信息 
    data.describe() 
    # 设置颜色映射 cmap，用于绘制相关性热图 
    cmap = sns.diverging_palette(70,20,s=50, l=40, n=6,as_cmap=True) 
    # 计算转换后数据的协方差矩阵 
    corrmat= label_data.corr() 
    # 设置绘图尺寸 
    f, ax = plt.subplots(figsize=(12,12)) 
    # 绘制相关性热图，设置颜色映射为 cmap，并显示数值注释 
    sns.heatmap(corrmat,cmap=cmap,annot=True, ) 
    # 将特征 X 赋值为除去价格的所有列，目标变量 y 赋值为价格列 
    X= label_data.drop(["price"],axis =1) 
    y= label_data["price"] 
    # 将数据集划分为训练集和测试集，测试集大小为 25%，随机种子为 7 
    X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.25, random_state=7) 
    # 构建不同回归模型的管道，每个管道包括标准化和模型 
    pipeline_lr=Pipeline([("scalar1",StandardScaler()), ("lr_classifier",LinearRegression())]) 
    pipeline_dt=Pipeline([("scalar2",StandardScaler()), ("dt_classifier",DecisionTreeRegressor())]) 
    pipeline_rf=Pipeline([("scalar3",StandardScaler()), ("rf_classifier",RandomForestRegressor())]) 
    pipeline_kn=Pipeline([("scalar4",StandardScaler()), ("rf_classifier",KNeighborsRegressor())]) 
    pipeline_xgb=Pipeline([("scalar5",StandardScaler()), ("rf_classifier",XGBRegressor())]) 
    # 构建不同回归模型的管道，每个管道包括标准化和模型 
    pipelines = [pipeline_lr, pipeline_dt, pipeline_rf, pipeline_kn, pipeline_xgb] 
    # 创建管道字典 pipe_dict，用于方便引用模型类型 
    pipe_dict = {0: "LinearRegression", 1: "DecisionTree", 2: "RandomForest",3: "KNeighbors", 4: "XGBRegressor"} 
    # 遍历每个管道，拟合训练数据 
    for pipe in pipelines: 
        pipe.fit(X_train, y_train) 
    # 初始化一个空列表，用于存储不同模型的交叉验证结果 
    cv_results_rms = [] 
    # 遍历 pipelines 列表中的每个模型及其索引 
    for i, model in enumerate(pipelines): 
        # 使用 10 折交叉验证计算模型的负均方根误差 
        cv_score = cross_val_score(model, X_train,y_train,scoring="neg_root_mean_squared_error", cv=10) 
        # 将交叉验证结果添加到 cv_results_rms 列表 
        cv_results_rms.append(cv_score) 
        # 打印每个模型的平均交叉验证结果 
        print("%s: %f " % (pipe_dict[i], cv_score.mean())) 
    # 使用 XGBRegressor 模型对测试数据进行预测 
    pred = pipeline_xgb.predict(X_test) 
    # 打印模型评估指标 
    print("R^2:",metrics.r2_score(y_test, pred)) 
    print("Adjusted R^2:",1 - (1-metrics.r2_score(y_test, pred))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)) 
    print("MAE:",metrics.mean_absolute_error(y_test, pred)) 
    print("MSE:",metrics.mean_squared_error(y_test, pred)) 
    print("RMSE:",np.sqrt(metrics.mean_squared_error(y_test, pred)))

点击左上角运行按钮，运行代码：

数据结果：

Categorical variables:
['cut', 'color', 'clarity']
LinearRegression: -209.588493 
DecisionTree: -57.556786 
RandomForest: -48.680239 
KNeighbors: -58.089888 
XGBRegressor: -66.447094 
R^2: 0.9958575367927551
Adjusted R^2: 0.9957021944224834
MAE: 40.27821731567383
MSE: 2548.892578125
RMSE: 50.4865583905756

图示结果：