【机器学习】决策树_探索泰坦尼克号乘客存活实例

本文探讨了如何使用决策树算法预测泰坦尼克号乘客的生存情况,通过对原始数据进行预处理,包括独热编码和填充缺失值,然后在sklearn库中构建决策树模型,并调整参数以提高测试集上的预测准确性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

因本人刚开始写博客,学识经验有限,如有不正之处望读者指正,不胜感激;也望借此平台留下学习笔记以温故而知新。

Lab:使用决策树探索泰坦尼克号乘客存活情况

开始

在引导项目中,你研究了泰坦尼克号存活数据并能够对乘客存活情况作出预测。在该项目中,你手动构建了一个决策树,该决策树在每个阶段都会选择一个与存活情况最相关的特征。幸运的是,这正是决策树的运行原理!在此实验室中,我们将通过在 sklearn 中实现决策树使这一流程速度显著加快。

我们首先将加载数据集并显示某些行。

# Import libraries necessary for this project
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames

# Pretty display for notebooks
%matplotlib inline

# Set a random seed
import random
random.seed(42)

# Load the dataset
in_file = 'titanic_data.csv'
full_data = pd.read_csv(in_file)

# Print the first few entries of the RMS Titanic data
display(full_data.head())
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

下面是每位乘客具备的各种特征:

  • Survived:存活结果(0 = 存活;1 = 未存活)
  • Pclass:社会阶层(1 = 上层;2 = 中层;3 = 底层)
  • Name:乘客姓名
  • Sex:乘客性别
  • Age:乘客年龄(某些条目为 NaN
  • SibSp:一起上船的兄弟姐妹和配偶人数
  • Parch:一起上船的父母和子女人数
  • Ticket:乘客的票号
  • Fare:乘客支付的票价
  • Cabin:乘客的客舱号(某些条目为 NaN
  • Embarked:乘客的登船港(C = 瑟堡;Q = 皇后镇;S = 南安普顿)

因为我们对每位乘客或船员的存活情况感兴趣,因此我们可以从此数据集中删除 Survived 特征,并将其存储在单独的变量 outcome 中。我们将使用这些结果作为预测目标。
运行以下代码单元格,以从数据集中删除特征 Survived 并将其存储到 outcome 中。

# Store the 'Survived' feature in a new variable and remove it from the dataset
outcomes = full_data['Survived']
features_raw = full_data.drop('Survived', axis = 1)

# Show the new dataset with 'Survived' removed
display(features_raw.head())
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
013Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
121Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
233Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
341Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
453Allen, Mr. William Henrymale35.0003734508.0500NaNS

相同的泰坦尼克号样本数据现在显示 DataFrame 中删除了 Survived 特征。注意 data(乘客数据)和 outcomes (存活结果)现在是成对的。意味着对于任何乘客 data.loc[i],都具有存活结果 outcomes[i]

预处理数据

现在我们对数据进行预处理。首先,我们将对特征进行独热编码。

features = pd.get_dummies(features_raw)

现在用 0 填充任何空白处。

features = features.fillna(0.0)
display(features.head())
PassengerIdPclassAgeSibSpParchFareName_Abbing, Mr. AnthonyName_Abbott, Mr. Rossmore EdwardName_Abbott, Mrs. Stanton (Rosa Hunt)Name_Abelson, Mr. Samuel...Cabin_F G73Cabin_F2Cabin_F33Cabin_F38Cabin_F4Cabin_G6Cabin_TEmbarked_CEmbarked_QEmbarked_S
01322.0107.25000000...0000000001
12138.01071.28330000...0000000100
23326.0007.92500000...0000000001
34135.01053.10000000...0000000001
45335.0008.05000000...0000000001

5 rows × 1730 columns

(TODO) 训练模型

现在我们已经准备好在 sklearn 中训练模型了。首先,将数据拆分为训练集和测试集。然后用训练集训练模型。

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, outcomes, test_size=0.2, random_state=42)
# Import the classifier from sklearn
from sklearn.tree import DecisionTreeClassifier

# TODO: Define the classifier, and fit it to the data
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

测试模型

现在看看模型的效果。我们计算下训练集和测试集的准确率。

# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate the accuracy
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)
The training accuracy is 1.0
The test accuracy is 0.810055865922

练习:改善模型

结果是训练准确率很高,但是测试准确率较低。我们可能有点过拟合了。

现在该你来发挥作用了!训练新的模型,并尝试指定一些参数来改善测试准确率,例如:

  • max_depth
  • min_samples_leaf
  • min_samples_split

你可以根据直觉、采用试错法,甚至可以使用网格搜索法!

挑战: 尝试在测试集中获得 85% 的准确率,如果需要提示,可以查看接下来的解决方案 notebook。

# Training the model
model = DecisionTreeClassifier(max_depth=6, min_samples_leaf=6, min_samples_split=10)
model.fit(X_train, y_train)

# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculating accuracies
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)
The training accuracy is 0.870786516854
The test accuracy is 0.854748603352
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

西瓜情怀总是籽

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值