如何根据 CRISP-DM 生命周期创建数据科学项目

原创于 2025-12-20 00:00:10 发布 · 365 阅读

20 ·

CC 4.0 BY-SA版权

License CC BY-NC-SA 4.0 / 自豪地采用谷歌翻译

文章标签：

#榛樿鍒嗙被

榛樿鍒嗙被专栏收录该内容

1071 篇文章

订阅专栏

原文：towardsdatascience.com/how-i-created-a-data-science-project-following-a-crisp-dm-lifecycle-8c0f5f89bba1

简介

CRISP-DM 代表跨行业数据挖掘标准流程，这是一个对任何希望使用它的人开放的数据挖掘框架。

它的第一个版本是由 SPSS、戴姆勒-奔驰和 NCR 创建的。然后，一组公司对其进行开发和演变，形成了 CRISP-DM，如今它是数据科学中最知名和广泛采用的框架之一。

该过程包括 6 个阶段，并且是灵活的。它更像是一个活体有机体，你可以在各个阶段之间来回移动，迭代并改进结果。

阶段包括：

业务理解

数据理解

数据准备

建模

评估

部署

小箭头显示了从业务理解到部署的自然路径——直接发生交互的地方——而圆圈表示阶段之间的循环关系。这意味着项目并不以部署结束，而是由于项目引发的新业务问题或可能需要的调整而可以重新启动。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/42bf6e17e50188a1773b363dec9588cd.png

CRISP-DM. 图片来源：维基百科

在这篇文章中，我们将使用 CRISP-DM 步骤跟踪一个项目在其生命周期中的发展。我们的主要目标是展示使用这个框架对数据科学家和公司的好处。

让我们深入探讨。

项目

让我们回顾一个遵循 CRISP-DM 框架的项目。

总结来说，我们的项目是创建一个分类模型，以估计客户在我们客户的机构（一家银行）提交定期存款的概率。

这里是 GitHub 代码库，如果您想边读文章边编码或跟随它，可以查看。

GitHub – gurezende/CRISP-DM-Classification: 使用 CRISP-DM…的端到端分类项目

业务理解

理解业务对于任何项目都是至关重要的，不仅仅是数据科学项目。我们必须了解如下事项：

业务是什么？
它的产品是什么
我们销售/提供什么？
对这个项目有什么期望？
成功的定义是什么？
指标

在这个项目中，我们与一家银行合作，因此我们谈论的是金融行业。我们的客户为人们提供金融解决方案，使他们能够轻松地在安全的环境中接收、储蓄和投资他们的钱。

客户联系我们讨论一些基于电话的直销活动，目的是转化金融产品（定期存款）。然而，他们感觉在管理者上浪费了时间和精力，以获得预期的结果，因此客户希望通过专注于转化概率更高的客户来增加/优化转化率。

当然，商业是一个复杂的话题。几个因素可能会影响活动的结果，但为了简单起见，我们将直接进入这个解决方案：

创建一个预测模型，为管理者提供客户是否会转化的概率。

拥有这些信息，管理者将配备一个工具，以更高的成功率来选择电话呼叫，而不是那些在过程中需要更多工作的客户。

因此，这个项目的成功定义是估计转化概率，模型的指标将是 F1 分数。对于业务来说，指标可能是转化率，这将在前后比较研究中进行比较。

接下来，我们需要开始接触数据。

_ 数据理解

我们将使用的数据集是 UCI 数据科学存储库中的银行营销数据集。它是在 Creative Commons 4.0 许可下开源的。

在这个项目中安装和导入的模块可以在项目的 GitHub 页面上找到。

!pip install ucimlrepo --quiet
from ucimlrepo import fetch_ucirepo

# fetch dataset
bank_marketing = fetch_ucirepo(id=222)

# data (as pandas dataframes)
df = pd.concat([bank_marketing.data.features, bank_marketing.data.targets], 
               axis=1)
df = df.rename(columns={'day_of_week':'day'})

# View
df.sample(3)

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/aa202fda060736f239fa1b806a05a2a5.png

首次查看导入的数据集。图片由作者提供。

在开始处理数据之前，我们将将其分为训练集和测试集，以确保我们避免*数据泄露*（机器学习）。

# Split in train and test sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('y', axis=1),
                                                    df['y'],
                                                    test_size=0.2,
                                                    stratify=df['y'],
                                                    random_state=42)

# train
df_train = pd.concat([X_train, y_train], axis=1)

# test
df_test = pd.concat([X_test, y_test], axis=1)

太好了。现在我们准备继续前进，理解数据。这也被称为探索性数据分析（EDA）。

探索性数据分析

EDA（探索性数据分析）的第一步是统计描述数据。这将带来洞察力，以开始理解数据，例如发现潜在错误的变量或异常值，对分布和平均值有感觉，以及了解哪些类别对于分类变量是最频繁的。

# Statistical description
df_train.describe(include='all').T

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/516110b3f444239c192c272cf9cc763b.png

数据的统计描述。图片由作者提供。

这条简单的命令允许我们获得以下见解：

客户的平均年龄为 40 岁。分布偏向右侧。
超过 20%的客户是蓝领工人。
大多数客户已婚，拥有中等教育水平，有房屋贷款。
只有大约 2%的客户出现支付违约。
转化率约为 11.7%。
数据在负类方面高度不平衡。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/b82bf373ded7389d8711c8cf8830e681.png

目标变量。负类占主导地位。图片由作者提供。

一旦我们知道了目标变量的分布，就需要了解预测变量如何与目标变量相互作用，试图找出哪些变量可能更适合建模目标变量的行为。

年龄与转化率 | 转化为活动的客户年龄略低于未转化者。然而，尽管 KS 检验显示它们在统计上不同，但两个分布的视觉效果相似。

#Sample 1 - Age of the converted customers
converted = df_train.query('y == "yes"')['age']

#Sample 2 - Age of the not converted customers
not_converted = df_train.query('y == "no"')['age']

# Kolmogorov-Smirnov Test
# The null hypothesis is that the two distributions are identical
from scipy.stats import ks_2samp
statistic, p = ks_2samp(converted, not_converted)

if p > 0.05:
    print("The distributions are identical.")
else:
    print("The distributions are not identical: p-value ==", round(p,10))

----------
[OUT]:
The distributions are not identical: p-value == 0.0

# Age versus Conversion
plt.figure( figsize=(10,5))
ax = sns.boxenplot(data=df_train, x='age', y='y', hue='y', alpha=0.8)
plt.suptitle('Age versus Conversion')
plt.ylabel('Converted')
plt.title('Conversions are concentrated between 30 and 50 years old, which is not that different from the not converted', size=9)

# Annotation
# Medians and Averages
median_converted = df_train.query('y == "yes"')['age'].median()
median_not_converted = df_train.query('y == "no"')['age'].median()
avg_converted = df_train.query('y == "yes"')['age'].mean()
avg_not_converted = df_train.query('y == "no"')['age'].mean()
# Annotation - Insert text with Average and Median for each category
plt.text(95, 0, f"Avg: {round(avg_not_converted,1)} nMedian: {median_not_converted}",
         ha="center", va="center", rotation=0,
         size=9, bbox=dict(boxstyle="roundtooth, pad=0.5", fc="lightblue",
         ec="r", lw=0))
plt.text(95, 1, f"Avg: {round(avg_converted,1)} nMedian: {median_converted}",
         ha="center", va="center", rotation=0,
         size=9, bbox=dict(boxstyle="roundtooth, pad=0.5", fc="orange", 
         ec="r", lw=0));

之前的代码产生了这个可视化。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/6ae308758e58ba25e03d16402779bc77.png

年龄与转化率。图片由作者提供。

工作与转化率 | 持有管理职位的客户转化率更高，其次是技术人员、蓝领、行政人员和退休人员。

# job versus Conversions == "YES"
converted = df_train.query('y == "yes"')
plt.figure( figsize=(10,5))
# order of the bars from highest to lowest
order = df_train.query('y == "yes"')['job'].value_counts().index
# Plot and title
ax = sns.countplot(data=converted,
                   x='job',
                   order=order,
                   palette= 5*["#4978d0"] + 6*["#7886a0"])
plt.suptitle('Job versus Converted Customers')
plt.title('Most of the customers who converted are in management jobs. n75% of the conversions are concentrated in 5 job-categories', size=9);
# X label rotation
plt.xticks(rotation=80);
#add % on top of each bar
for pct in ax.patches:
    ax.annotate(f'{round(pct.get_height()/converted.shape[0]*100,1)}%',
                (pct.get_x() + pct.get_width() / 2, pct.get_height()),
                ha='center', va='bottom')

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/b2893f28418b06adc266e8b44b8a2d35.png

工作与转化率。图片由作者提供。

好吧，在这里重复代码来展示可视化并没有太多意义，所以我现在将只展示图形和分析。同样，所有内容都可以在这个GitHub 仓库中找到。

婚姻状况与转化率 | 已婚客户将更多资金转换为定期存款。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/5c2e7b8834fca7f8d7af64dc4fcc7906.png

婚姻状况与转化率。图片由作者提供。

教育与转化率 | 受教育程度更高的人将更多资金转换为金融产品。然而，转换后的分布遵循数据集分布，因此这个变量可能无法区分转化与未转化。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/3290a454805c0cc1a8ff4825220ace76.png

教育与转化率。图片由作者提供。

余额与转化率 | 账户余额较高的客户转化率更高。我们测试了样本的统计显著性，确实存在差异。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/8dde455d8b66425a6f93a9ecf4aa9955.png

余额与转化率。图片由作者提供。

在之前的图表中，我们任意移除了超过 98 百分位数的数据点，这样可视化效果更好。我们可以看到，转换为存款的客户通常余额更高，但我们无法判断两组之间是否存在统计差异。让我们来测试一下。鉴于分布严重向右偏斜，我们将使用非参数检验，即柯尔莫哥洛夫-斯米尔诺夫检验。

#Sample 1 - Balance of the converted customers
converted = df_train.query('y == "yes"')['balance']

#Sample 2 - Balance of the not converted customers
not_converted = df_train.query('y == "no"')['balance']

# Kolmogorov-Smirnov Test
# The null hypothesis is that the two distributions are identical
from scipy.stats import ks_2samp
statistic, p = ks_2samp(converted, not_converted)

if p > 0.05:
    print("The distributions are identical.")
else:
    print("The distributions are not identical: p-value ==", round(p,4))

---------
[OUT]: 
The distributions are not identical: p-value == 0.0

是否存在负余额的人将资金转换为定期存款？ 常识告诉我们，为了能够存款，你必须有可用的资金。因此，如果客户余额为负，他们不应该能够转换为存款。然而，我们将看到这种情况确实发生了。

neg_converted = df_train.query('y == "yes" &amp; balance < 0').y.count()
pct = round(neg_converted/df_train.query('y == "yes"').y.count()*100,1)
print(f'There are {neg_converted} conversions from people with negative acct balance. nThis represents {pct}% of the total count of customers converted.')

---------
[OUT]:
There are 161 conversions from people with negative acct balance. 
This represents 3.8% of the total count of customers converted.

持续时间与转化率 | 在这个图表中，我们可以直观地注意到电话通话时长对转化率的影响。转化的客户在通话中停留的时间是其他客户的两倍或更多。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/a9078d3bc7aff0a25e5ef225f22b53e2.png

持续时间与转化率。图片由作者提供。

营销接触与转化率 | 通常，转化的客户接受了 2 到 4 次接触。在第 5 次接触之后，转化的点开始变得稀疏。对于未转化，点数在 13 次接触左右更为一致。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/b8b4dcdd9a27da0e1ef57a88ab591f07.png

营销接触与转化。图片由作者提供。

前期接触与转化 | 看起来，更多的前期接触可以影响客户转化。我们在图表中注意到，转化的客户比未转化的客户多接了几次电话。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/7890f47d599660156f8753b8e48188a3.png

前期接触与转化率。图片由作者提供。

前期营销成果与转化率 | 过去转化的客户更有可能再次转化。同样，过去失败的客户倾向于重复失败。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/ee29cbecb3bc360b1ead09ced469e2e6.png

前期成果与转化率。图片由作者提供。

接触方式与转化率 | 尽管通过手机联系的客户转化率更高，但这仅仅表明电话座机较少。两种接触方式的转化率比例相似。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/2284dd543aeb816321349d109fe46305.png

接触方式与转化率。图片由作者提供。

月度与转化率 | 中间月份的转化率更高，然而，大约 76%的电话是在这些月份打的。可能是在这些月份营销活动更为密集。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/2338b8023955b14e6f9e2b97f58db003.png

月度与转化率。图片由作者提供。

日度与转化率 | 转化大多发生在最可能的付款日 5、15 和 30 日。在这些日期周围我们可以注意到更高的峰值。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/52d74f4e5d770c54c80620b5553b200a.png

日度与转化率。图片由作者提供。

上次接触时间与转化率 | 大多数转化发生在上次营销活动后的 100 天内接触的客户。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/3c061ee49fe6ab15163dd1c9d0674fbd.png

pDays 与转化率。图片由作者提供。

大多数转化（64%）发生在首次接触。

# The impact of the recency of the contact over conversions
total = df_train.query('y == "yes"').y.count()
print('First contact:', round( df_train.query('y == "yes" &amp; pdays == -1').y.count()/total*100, 0 ), '%')
print('Up to 180 days:', round( df_train.query('y == "yes" &amp; pdays > 0 &amp; pdays <= 180').y.count()/total*100, 0 ), '%')
print('More than 180 days:', round( df_train.query('y == "yes" &amp; pdays > 180').y.count()/total*100, 0 ), '%')

-------
[OUT]:
First contact: 64.0 %
Up to 180 days: 18.0 %
More than 180 days: 18.0 %

然而，这与大多数数据并没有不同。只有首次接触就未转化的客户比例甚至更高（84%）。

# The impact of the recency of the contact over Not converted
total = df_train.query('y == "no"').y.count()
print('First contact:', round( df_train.query('y == "no" &amp; pdays == -1').y.count()/total*100, 0 ), '%')
print('Up to 180 days:', round( df_train.query('y == "no" &amp; pdays > 0 &amp; pdays <= 180').y.count()/total*100, 0 ), '%')
print('More than 180 days:', round( df_train.query('y == "no" &amp; pdays > 180').y.count()/total*100, 0 ), '%')

-------
[OUT]:
First contact: 84.0 %
Up to 180 days: 6.0 %
More than 180 days: 10.0 %

房屋与转化 | 没有房屋贷款的人转化率更高——转化率高 1.7 倍。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/78ce2733c90543095d8b640222edc175.png

房屋贷款与转化。图由作者提供。

个人贷款与转化 | 没有个人贷款的人转化率更高。尽管它遵循整体分布，但没有贷款的人转化率比例更高。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/86d547b14eb2e4a9265210e0186e1960.png

个人贷款与转化。图由作者提供。

违约与转化 | 转化几乎完全来自没有支付违约的人，这是有道理的，因为那些有违约的人可能没有钱。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/cce7c9612c1b187cfb13bfa64850b1c3.png

违约与转化。图由作者提供。

没有违约记录的人转化率是那些有违约记录的人的两倍（12%）。

接下来，我们准备撰写发现总结。

探索总结

在对数据进行彻底探索后，我们可以总结如下：

转化者档案是一个 38 到 41 岁 的人，担任管理角色，已婚，拥有至少中学水平的 良好教育，在他们的账户上保持 正余额，没有房屋或个人贷款，因此负债较少。
大多数转化发生在首次接触（64%）。
在首次接触后未转化的客户在转化前收到了 2 到 4 次接触。
当前活动的联系越多，客户转化的概率就越低。
从未联系过的客户平均需要比现有客户更多的联系。
被联系过 10 次以上的人，在之前的活动中转化的可能性更大。
之前活动的联系可能会影响当前活动的转化，这可能表明随着时间的推移关系很重要。
在之前的活动中已经转化的客户更有可能重复转化，而未能转化也显示出不再转化的趋势。
联系时间越长，转化的可能性就越高。已经转化的客户在通话中保持连接的时间是未转化的四倍。然而，我们无法使用通话时长作为预测指标。

探索后的图形显示，变量 duration、job、marital、balance、previous、campaign、default、housing 和 loan 对模型来说很有趣，因为它们对目标变量有更直接的影响。然而，duration 不能使用，因为在电话结束之前无法知道通话的持续时间。变量 poutcome 也看起来很有希望，但它有太多的 NAs，因此需要进一步处理才能考虑。

数据准备

理解数据对于更好的建模非常重要。在初步洞察之后，我们有一个想法，即什么可以驱动更多类别的分离。

下一步是为建模准备这个数据集，将变量转换为类别或数字，因为许多数据科学算法只需要数字作为输入。

让我们开始工作。

处理缺失数据

缺失数据会破坏我们的模型，因此我们必须通过删除或为这些观测值输入数据来处理它们。

这里是我们缺失的数据点。

# Checking for missing data
df_train.isna().sum()[df_train.isna().sum() > 0]

-------
[OUT]:

job 234
education 1482
contact 10386
poutcome 29589

从 job 开始，在那些 234 个 NAs 中，我们看到如果删除这些 NAs，将有 28 个转换客户会丢失（0.6%）。

# NAs in job
 (df_train #data
 .query('job != job') # only NAs
 .groupby('y') #group by target var
 ['y']
 .count() #count values
 )

-------
[OUT]:

y 
no 206
yes 28

在这种情况下，有三个选项：

删除 NAs：只有 0.6% 可能不会产生影响
使用随机森林来预测工作类型。
添加最频繁的工作类别，即蓝领。

我们将继续删除，因为我们认为这个数字太小，不值得预测工作。

# Check the impact of NAs for the job variable in the conversions
df_train.query('job != job').groupby('y')['y'].value_counts()

# Drop NAs.
df_train_clean = df_train.dropna(subset='job')

接下来，查看 education 缺失值。有 1482 个缺失条目，其中 196 个是 Yes，这代表了 4.6% 的转换客户。在这种情况下，这是一个相当多的转换观测值需要删除。

在这种情况下，我们将使用 feature_engine 中的 CategoricalImputer 输入这些 NAs 的教育中最频繁的类别。

# Check the impact of NAs for the education variable in the conversions
df_train.query('education != education').groupby('y')['y'].value_counts()

# Simple Imputer
imputer = CategoricalImputer(
    variables=['education'],
    imputation_method="frequent"
)

# Fit and Transform
imputer.fit(df_train_clean)
df_train_clean = imputer.transform(df_train_clean)

对于 outcome，我们必须提出一个新的类别。因此，这个变量显示了之前营销活动的结果。根据我们在探索阶段的洞察，过去已经转换的客户更有可能再次转换。因此，这个变量对模型来说变得有趣。然而，有很多缺失值需要进入一个单独的类别，所以我们不会通过大量数据的插补来偏置我们的模型。我们将输入 “未知” 作为 NAs，正如数据文档中所述。

# Input "unknown" for NAs.
df_train_clean['poutcome'] = df_train_clean['poutcome'].fillna('unknown')

对于 contact，我们将像数据文档中说的那样，为 NAs 添加 “未知”。

# Fill NAs with "unknown"
df_train_clean['contact'] = df_train_clean['contact'].fillna('unknown')

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/3d7fd4103f4cc49ffe861e3c2d006bc0.png

缺失值的数据清理。图由作者提供。

接下来，我们需要在这个数据集中进行其他转换。

分类转换

许多模型处理分类数据不佳。因此，我们需要使用编码类型将数据转换为数字。以下是本项目将使用的策略：

education、contact、balance、marital、job 和 poutcome：对于这些变量，One Hot Encoding 可能是理想的。
default、housing、loan 和 y 是二进制变量，将被映射为 no: 0 和 yes: 1。

# Binarizing default, housing, loan, and y
df_train_clean = df_train_clean.replace({'no': 0, 'yes': 1})

在 One Hot Encoding 之前，需要对 balance 进行之前的分箱。

# Balance in 3 categories: <0 = 'negative, 0-median = 'avg', >median = 'over avg'
df_train_clean = (
    df_train_clean
    .assign(balance = lambda x: np.where(x.balance < 0,
                                          'negative',
                                          np.where(x.balance < x.balance.median(),
                                                   'avg',
                                                   'over avg')
                                          )
    )
)

# One Hot Encoding for 'marital', 'poutcome', 'education', 'contact', 'job', 'balance'
from feature_engine.encoding import OneHotEncoder

# Instance
ohe = OneHotEncoder(variables=['marital', 'poutcome', 'education', 'contact', 'job', 'balance'], drop_last=True)

# Fit
ohe.fit(df_train_clean)

# Transform
df_train_clean = ohe.transform(df_train_clean)

# Move y to the first column
df_train_clean.insert(0, 'y', df_train_clean.pop('y'))

接下来，将月份转换为数值变量。

# Month to numbers
df_train_clean['month'] = df_train_clean['month'].map({ 'jan':1, 'feb':2, 'mar':3, 'apr':4, 'may':5, 'jun':6, 'jul':7, 'aug':8, 'sep':9, 'oct':10, 'nov':11, 'dec':12})

其他数值变量将被分类（分箱）以减少单个值的数量，这可以帮助分类模型找到模式。

# Function to replace the variable data with the new categorized bins
def variable_to_category(data, variable, k):
  return pd.cut(data[variable], bins=k).astype(str)

# Transforming variable Age into bins
# Using Sturges rule, where number of bins k = 1 + 3.3*log10(n)
k = int( 1 + 3.3*np.log10(len(df_train_clean)) )

# Categorize age, balance, duration, previous, pdays
for var in str.split('age,pdays,previous', sep=','):
  df_train_clean[var] = variable_to_category(df_train_clean, var, k=k)

# CatBoost Encoding the dataset
df_train_clean = ce.CatBoostEncoder().fit_transform(df_train_clean, df_train_clean['y'])

# View of the final dataset for modeling
df_train_clean.sample(5)

接下来，你可以看到用于建模的最终数据的部分视图。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/4eafb02f949b04990b3bf969a5c47445.png

清洗和转换后的数据用于建模。图片由作者提供。

建模是按顺序进行的。

模型

一旦数据准备和转换完成，我们就可以开始建模。对于这个建模，我们将测试许多算法以查看哪个表现最好。考虑到数据有巨大的不平衡，88%的观察值被分类为 no，我们将为类别使用权重。

对于这个初始测试，让我们随机选择 10k 个观察值作为样本，这样运行会更快。

# X and y sample for testing models
df_sample = df_train_clean.sample(10_000)
X = df_sample.drop(['y'], axis=1)
y = df_sample['y']

测试的代码相当广泛，但可以在GitHub 仓库中看到。

# Example of using the function with your dataset
results = test_classifiers(X, y)
print(results)

-------
[OUT]:
               Classifier  F1 Score  Cross-Validated F1 Score
0                Catboost  0.863289                  0.863447
1             Extra Trees  0.870542                  0.862850
2       Gradient Boosting  0.868414                  0.861208
3                 XGBoost  0.858113                  0.858268
4           Random Forest  0.857215                  0.855420
5                AdaBoost  0.858410                  0.851967
6     K-Nearest Neighbors  0.852051                  0.849515
7           Decision Tree  0.831266                  0.833809
8  Support Vector Machine  0.753743                  0.768772
9     Logistic Regression  0.747108                  0.762013

对于这个问题表现最好的模型是提升模型。CatBoost 是最佳估计器，所以我们将从现在开始使用它。

让我们继续使用新的分割和测试，现在针对整个清洗后的训练集。

# Split X and y
X = df_train_clean.drop(['y', 'duration'], axis=1)
y = df_train_clean['y']

# Split Train and Validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

让我们从包含所有列的基础模型开始，尝试从该起点调整它。

model = CatBoostClassifier(verbose=False)
# train the model
model.fit(X_train, y_train)

prediction = model.predict(X_val)

# confusion matrix
cm = pd.DataFrame(confusion_matrix(y_val, prediction)  )
print ("Confusion Matrix : n")
display(cm)

# Evaluate the weighted model
print('Base Model:')
print(classification_report(y_val, prediction))

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/f1623f4ed23f14941c478770f5c4cbb3.png

基础模型的成果。图片由作者提供。

如预期，模型在负类上表现非常好，因为向其倾斜的不平衡性很大。即使模型将所有内容都分类为 “no”，它仍然会在 88%的时间内是正确的。这就是为什么准确率不是分类的最佳指标。正类的精确度还不错，但召回率非常糟糕。让我们调整这个模型。

为了做到这一点，我运行了一个 GridSearchCV 并测试了 learning_rate、depth、class_weights、border_count 和 l2_leaf_reg 的几个值。超参数：

border_count：控制数值特征的分箱阈值数量。较低的值（例如，32 或 64）可以减少过拟合，这可能有助于模型在不平衡数据上更好地泛化。
l2_leaf_reg：向模型添加 L2 正则化。较高的值（例如，5 或 7）可以惩罚模型，降低其复杂性，并可能防止模型过度偏向多数类。
depth：控制决策树在分类中应该有多深。
learning_rate：每次迭代调整算法权重时，学习步长的大小是多少。
class_weights：对于不平衡数据很好，我们可以为少数类赋予更高的权重。

网格搜索返回了以下结果：

最佳参数：{‘border_count’: 64, ‘class_weights’: [1, 3], ‘depth’: 4, ‘l2_leaf_reg’: 5, ‘learning_rate’: 0.1}

在这里，我考虑的是，一个假阳性（当真实值为 0 时为 1）比一个假阴性（真实值为 1 却被分类为 0）更糟糕。这是因为，作为一个管理者，如果我看到一个有更高转换概率的客户，我不希望在这个假阳性通话上浪费精力。另一方面，如果我给一个概率较低的某人打电话，但那个人转换了，我就完成了我的销售。

因此，我根据这一点手动进行了一些调整，并得到了这个代码片段。

# Tuning the estimator
model2 = CatBoostClassifier(iterations=300,
                            depth=5,
                            learning_rate=0.1,
                            loss_function='Logloss',
                            eval_metric='F1',
                            class_weights={0: 1, 1: 3},
                            border_count= 64,
                            l2_leaf_reg= 13,
                            early_stopping_rounds=50,
                            verbose=1000)

# train the model
model2.fit(X_train, y_train)

prediction2 = model2.predict(X_val)

# confusion matrix
cm = pd.DataFrame(confusion_matrix(y_val, prediction2)  )
print ("Confusion Matrix : n")
display(cm)

# # Evaluate the weighted model
print('Tuned Catboost:')
print(classification_report(y_val, prediction2))
print('F1:', accuracy_score(y_val, prediction2))
print('Accuracy:', f1_score(y_val, prediction2))

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/6e150476fee18d8d22c761a1e058a2c6.png

调优模型的结果。图片由作者提供。

现在，我们仍然可以运行递归特征消除来选择更少的变量，并尝试使这个模型更简单。

df_train_selected = df_train_clean[['age',  'job_admin.', 'job_services', 'job_management', 'job_blue-collar', 'job_unemployed', 'job_student', 'job_technician',
                                    'contact_cellular', 'contact_telephone', 'job_retired', 'poutcome_failure', 'poutcome_other', 'marital_single', 'marital_divorced',
                                    'previous', 'pdays', 'campaign', 'month', 'day', 'loan', 'housing', 'default', 'poutcome_unknown', 'y']]

结果如下。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/04c9c64750b22fb3abc48b76ac336031.png

选择变量模型的结果。图片由作者提供。

尽管变量duration对于区分类别是一个好的分隔符，但由于无法知道电话的持续时间直到通话结束，所以不能使用它。但如果我们能知道，这些就是结果。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/f319ef62646493595993a79e208ba7dc.png

变量持续时间的结果。图片由作者提供。

看看我们如何显著提高 F1 分数！

我还尝试了一些集成模型，例如VotingClassifier和StackingClassifier。结果将在下面展示。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/e92169c6120223e852c9b21337e4cd72.png

投票和堆叠分类器。图片由作者提供。

训练了足够多的模型后，现在是时候评估结果并可能迭代调整最佳模型了。

评估

我喜欢创建一个表格来显示模型的成果。这样更容易将它们全部比较在一起。

pd.DataFrame({
    'Model':['Catboost Base', 'Catboost Tuned', 'Catboost Selected Variables', 'Voting Classifier', 'Voting Classifier + SMOTE', 'Catboost + duration', 'Stacking Classifier'],
    'F1 Score': [f1_score(y_val, prediction), f1_score(y_val, prediction2), f1_score(ys_val, prediction3), f1_score(y_val, y_pred), f1_score(y_val, y_pred2), f1_score(y_vald, prediction4), f1_score(y_val, y_pred3)],
    'Accuracy': [accuracy_score(y_val, prediction), accuracy_score(y_val, prediction2), accuracy_score(ys_val, prediction3), accuracy_score(y_val, y_pred), accuracy_score(y_val, y_pred2), accuracy_score(y_vald, prediction4), accuracy_score(y_val, y_pred3)]
}).sort_values('F1 Score', ascending=False)

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/cd628fc17b8493f12d6a2a3fb8fba1c1.png

模型的比较。图片由作者提供。

带有变量duration的 Catboost 模型无疑是最好的，然而我们不能使用这个额外的变量，因为直到通话结束，这个数据将不会对管理者可用，因此为预测保留这个变量是没有意义的。

因此，接下来的最佳模型是 Catboost 调优模型和选择变量的模型。让我们分析一下调优模型所呈现的错误。我喜欢的一种方法是创建一些直方图或密度图，这样我们可以看到每个变量的错误集中在哪里。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/b6910d951c13028165d87bfb5b126a99.png

按变量分布的错误。图片由作者提供。

在结束这项研究时，很明显，所提出的变量无法提供对类别的稳固分离。

不平衡问题严重，但用于纠正不平衡的技术——例如类别权重和 SMOTE——并不足以改善类别分离。这导致模型难以找到适当的模式来正确分类少数类 1（转化客户）并表现更好。

由于有太多客户没有转化的观察结果，标签为 0 的组合变异性太大，覆盖并隐藏了其中的类别 1。因此，这些“常见”的观察结果在两边都有相似的概率，这就是模型会失败的地方。由于不平衡，这些观察结果被错误分类，因为负类有更强的力量并产生了更多的偏差。

预测

为了预测结果，输入数据必须与训练期间提供的输入相同。因此，我创建了一个函数来处理这个问题。再次强调，这个函数可以在GitHub上找到。

# Preparing data for predictions
X_test, y_test = prepare_data(df_test)

# Predict
test_prediction = model3.predict(X_test)

# confusion matrix
cm = pd.DataFrame(confusion_matrix(y_test, test_prediction)  )
print ("Confusion Matrix : n")
display(cm)

# Evaluate the model
print('----------- Test Set Restuls -----------:')

print(classification_report(y_test, test_prediction))
print('-------------------------------')
print('F1:', f1_score(y_test, test_prediction))
print('-------------------------------')
print('Accuracy:', accuracy_score(y_test, test_prediction))

结果在预期范围内，即与我们在训练中看到的结果一致。假阳性略小于假阴性，这对我们的情况更好。这防止了管理者错误地追求那些不会转化的客户。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/a24cf6511bb206d79df78d79c55454f9.png

测试集的结果。图片由作者提供。

最后，我还创建了一个函数，一次预测一个观察结果，已经考虑到了部署应用。下面的代码预测一个观察结果。

obs = {'age': 37,
       'job': 'management',
       'marital': 'single',
       'education': 'tertiary',
       'default': 'no', #
       'balance': 100,
       'housing': 'yes', #
       'loan': 'no', #
       'contact': 'cellular', #
       'day': 2, #
       'month': 'aug', #
       'duration': np.nan,
       'campaign': 2, #
       'pdays': 272, #
       'previous': 10,
       'poutcome': 'success',
       'y':99}

# Prediction
predict_single_entry(obs)

----------
[OUT]:
array([[0.59514531, 0.40485469]])

因此，有 59%的概率这位客户不会转化。这项练习很有趣，因为我手动一次改变每个变量，可以看到哪些变量对模型有更大的影响。结果发现，变量default、housing、loan、day、contact_cellular、contact_telephone、month、campaign、pdays在修改时对概率的影响更大。

因此，我决定创建一个更简单的模型，使用这些变量。这正是 CRISP-DM 框架的真实价值。当我注意到一些新东西并回到起点进行另一轮迭代时，我几乎完成了建模。

这就是结果。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/774b2068f15ec25434713bb54feef1e3.png

测试集的最终模型结果。图片由作者提供。

这个模型不仅更简单，而且性能更好。收益很小，但当结果相似时，更简单的模型更好，因为它需要更少的数据、计算能力和训练时间。总体上是一个更经济的模型。

好吧，这就结束了。现在让我们转向最后的考虑。