Titanic Disaster Kaggle,里的经典入门题目,因为在学决策树所以找了一个实例学习了一下,完全萌新零基础,所以基本每一句都做了注释。
原文链接:Titanic: Simple Decision Tree model score(Top 3%) | Kaggle
目录
1. Preprocessing and EDA #预处理和探索性数据分析
2.1. SibSp and Parch column #兄弟姐妹和父母孩子
2.4. Woman or Child column #女人和孩子
2.4 Family Survived Rate column #家庭生存率
Titanic Disaster
Improve your score to 82.78% (Top 3%)
In this work I have used some basic techniques to process of the easy way Titanic dataset.
1. Preprocessing and EDA #预处理和探索性数据分析
Here, I reviewed the variables, impute missing values, found patterns and watched relationship between columns.
#第一部分的工作是查看变量,修补缺失值,通过观察数据之间的关系,进行特征工程。
1.1. Missing Values #缺失值
Reading the dataset and merging Train and Test to get better results.
#读取数据集并合并训练和测试以获得更好的结果
# Libraries used
import numpy as np
#运行速度非常快的数学库,主要用于数组计算
import pandas as pd
#分析结构化数据的工具集,基础是 Numpy
import seaborn as sns
#可视化库 是对matplotlib进行二次封装
import matplotlib.pyplot as plt
#可视化库
#机器学习库
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from numpy.random import seed
seed(11111)
#随机种子 可以让每次的随机数都相同 保证程序可以复现
# Reading
#读取训练集和测试集
train = pd.read_csv("../input/titanic/train.csv")
test = pd.read_csv("../input/titanic/test.csv")
# Putting on index to each dataset before split it
#指定PassengerId列将被设置为索引
train = train.set_index("PassengerId")
test = test.set_index("PassengerId")
# dataframe
#纵向合并两个DataFrame对象 axis=0纵向 sort=False列的顺序维持原样, 不进行重新排序。
df = pd.concat([train, test], axis=0, sort=False)
#输出df
df
| Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| PassengerId | |||||||||||
| 1 | 0.0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 2 | 1.0 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1.0 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 4 | 1.0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0.0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1305 | NaN | 3 | Spector, Mr. Woolf | male | NaN | 0 | 0 | A.5. 3236 | 8.0500 | NaN | S |
| 1306 | NaN | 1 | Oliva y Ocana, Dona. Fermina | female | 39.0 | 0 | 0 | PC 17758 | 108.9000 | C105 | C |
| 1307 | NaN | 3 | Saether, Mr. Simon Sivertsen | male | 38.5 | 0 | 0 | SOTON/O.Q. 3101262 | 7.2500 | NaN | S |
| 1308 | NaN | 3 | Ware, Mr. Frederick | male | NaN | 0 | 0 | 359309 | 8.0500 | NaN | S |
| 1309 | NaN | 3 | Peter, Master. Michael J | male | NaN | 1 | 1 | 2668 | 22.3583 | NaN | C |
1309 rows × 11 columns
#1309行 11列
As you can see Name, Sex, Ticket, Cabin, and Embarked column are objects, before processing each column we should know if there are NAs or missing values.
#姓名、性别、船票、客舱和登船地列都是对象,在预处理之前先查看一下数据的总体信息 判断是否有缺失数据
df.info()
#.info()函数用于获取 DataFrame 的简要摘要
<class 'pandas.core.frame.DataFrame'> Int64Index: 1309 entries, 1 to 1309 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Survived 891 non-null float64 1 Pclass 1309 non-null int64

本文通过预处理和特征工程,使用决策树模型对泰坦尼克号乘客生存情况进行了预测,实现了Kaggle竞赛Top3%的成绩。
最低0.47元/天 解锁文章
1249

被折叠的 条评论
为什么被折叠?



