Kaggle泰坦尼克号-决策树Top 3%-0基础代码详解

本文通过预处理和特征工程,使用决策树模型对泰坦尼克号乘客生存情况进行了预测,实现了Kaggle竞赛Top3%的成绩。

        Titanic Disaster Kaggle,里的经典入门题目,因为在学决策树所以找了一个实例学习了一下,完全萌新零基础,所以基本每一句都做了注释。

        原文链接:Titanic: Simple Decision Tree model score(Top 3%) | Kaggle

目录

1. Preprocessing and EDA  #预处理和探索性数据分析

1.1. Missing Values  #缺失值

1.3. Fare column  #票价列

1.4. Embarked column #登船地

1.5. Cabin column #船舱列

2. Feature Extraction  #特征工程

2.1. SibSp and Parch column #兄弟姐妹和父母孩子

2.2. Ticket column #船票

2.3. Name Column #姓名

2.4. Woman or Child column #女人和孩子

2.4 Family Survived Rate column #家庭生存率

3. Modeling #训练模型

4. Conclutions #结论

5. References #参考文献


Titanic Disaster

Improve your score to 82.78% (Top 3%)

In this work I have used some basic techniques to process of the easy way Titanic dataset.

1. Preprocessing and EDA  #预处理和探索性数据分析

Here, I reviewed the variables, impute missing values, found patterns and watched relationship between columns.

#第一部分的工作是查看变量,修补缺失值,通过观察数据之间的关系,进行特征工程。

1.1. Missing Values  #缺失值

Reading the dataset and merging Train and Test to get better results.

#读取数据集并合并训练和测试以获得更好的结果

# Libraries used

import numpy as np 
#运行速度非常快的数学库,主要用于数组计算

import pandas as pd 
#分析结构化数据的工具集,基础是 Numpy

import seaborn as sns
#可视化库 是对matplotlib进行二次封装

import matplotlib.pyplot as plt
#可视化库
​
#机器学习库
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
​
from numpy.random import seed
​
seed(11111)
#随机种子 可以让每次的随机数都相同 保证程序可以复现
# Reading
#读取训练集和测试集
train = pd.read_csv("../input/titanic/train.csv")
test = pd.read_csv("../input/titanic/test.csv")
​
# Putting on index to each dataset before split it
#指定PassengerId列将被设置为索引
train = train.set_index("PassengerId")
test = test.set_index("PassengerId")
​
# dataframe 
#纵向合并两个DataFrame对象 axis=0纵向 sort=False列的顺序维持原样, 不进行重新排序。
df = pd.concat([train, test], axis=0, sort=False)
​
#输出df
df
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0.0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 1.0 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0.0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ...
1305 NaN 3 Spector, Mr. Woolf male NaN 0 0 A.5. 3236 8.0500 NaN S
1306 NaN 1 Oliva y Ocana, Dona. Fermina female 39.0 0 0 PC 17758 108.9000 C105 C
1307 NaN 3 Saether, Mr. Simon Sivertsen male 38.5 0 0 SOTON/O.Q. 3101262 7.2500 NaN S
1308 NaN 3 Ware, Mr. Frederick male NaN 0 0 359309 8.0500 NaN S
1309 NaN 3 Peter, Master. Michael J male NaN 1 1 2668 22.3583 NaN C

1309 rows × 11 columns

#1309行 11列

As you can see Name, Sex, Ticket, Cabin, and Embarked column are objects, before processing each column we should know if there are NAs or missing values.

#姓名、性别、船票、客舱和登船地列都是对象,在预处理之前先查看一下数据的总体信息 判断是否有缺失数据

df.info()
#.info()函数用于获取 DataFrame 的简要摘要
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 1 to 1309
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    float64
 1   Pclass    1309 non-null   int64
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值