项目及数据集来自Kaggle。
持续更新中......
1. 提出问题
建立模型预测乘客是否生还。
2. 理解数据
数据特征含义:survival为目标变量,其他为特征。
Variable | Definition | Key |
---|---|---|
survival | Survival | 0 = No, 1 = Yes |
pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
sex | Sex | |
Age | Age in years | |
sibsp | # of siblings / spouses aboard the Titanic | |
parch | # of parents / children aboard the Titanic | |
ticket | Ticket number | |
fare | Passenger fare | |
cabin | Cabin number | |
embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
# load libraries of anlysis and visualization
import numpy as np
import pandas as pd
import re # Regular Expression operations
import matplotlib.pyplot as plt
%matplotlib inline
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# 观察数据
train.head() #前5行数据
train.sample(5) #随机5行数据
train.describe() #各列统计数据
train.dtypes #数据类型
# 载入seaborn作图
import seaborn as sns
f,ax = plt.subplots(3,4,figsize=(20,16))
sns.countplot('Pclass',data=train,ax=ax[0,0])
sns.countplot('Sex',data=train,ax=ax[0,1])
sns.boxplot(x='Pclass',y='Age',data=train,ax=ax[0,2])
sns.distplot(train['Fare'].dropna(),ax=ax[2,0],kde=False,color='b')
sns.countplot('Embarked',data=train,ax=ax[2,2])
sns.countplot('SibSp',hue='Survived',data=train,ax=ax[0,3],palette='husl')
sns.countplot('Parch',hue='Survived',data=train,ax=ax[1,3],palette='husl')
sns.countplot('Embarked',hue='Survived',data=train,ax=ax[2,3],palette='husl')
sns.countplot('Pclass',hue='Survived',data=train,ax=ax[1,0],palette='husl')
sns.countplot('Sex',hue='Survived',data=train,ax=ax[1,1],palette='husl')
sns.distplot(train[train['Survived']==0]['Age'].dropna(),ax=ax[1,2],kde=False,color='r',bins=5)
sns.distplot(train[train['Survived']==1]['Age'].dropna(),ax=ax[1,2],kde=False,color='g',bins=5)
sns.swarmplot(x='Pclass',y='Fare',hue='Survived',data=train,ax=ax[2,1],palette='husl')
ax[0,0].set_title('Total Passengers by Class')
ax[0,1].set_title('Total Passengers by Sex')
ax[0,2].set_title('Age boxplot by Class')
ax[0,3].set_title('Survival Rate by SibSp')
ax[1,0].set_title('Survival Rate by Pclass')
ax[1,1].set_title('Survival Rate by gender')
ax[1,2].set_title('Survival Rate by Age')
ax[1,3].set_title('Survival Rate by Parch')
ax[2,0].set_title('Fare Distribution')
ax[2,1].set_title('Survival Rate by Fare and Pclass')
ax[2,2].set_title('Total Passengers by Embarked')
ax[2,3].set_title('Survival Rate by Embarked')
3. 数据清理
a. 找出异常值和离群点
# 检测异常值 因为此数据中没有明显异常点,故检测离群点(1.5个IQR(四分位距)以外的点)
'''
定义离群点函数
输入:dataset,MAX离群特征个数n,特征名
输出:超过n个离群特征的样本index
'''
# 调取collections的Counter,用于对list计数
from collections import Counter
def detect_outliers(df,n,feature):
outlier_indices = []
for f in feature:
# 1st quartile(25%)
Q1=np.percentile(df[f],25)
# 3rd quartile(75%)
Q3=np.percentile(df[f],75)
# Interquartile range四分位距
IQR=Q3-Q1
outlier_step=1.5*IQR
# 生成该特征中为离群点的样本index
outlier_list_col = df[(df[f]<(Q1-outlier_step))|(df[f]>(Q3+outlier_step))].index
# 生成一个含有离群点样本index的list
outlier_indices.extend(outlier_list_col)
# 对该list计数,生成字典key为要计数的值,value为key的计数值
outlier_