Kaggle竞赛:泰坦尼克号灾难数据分析简单案例

本文介绍了一次Kaggle竞赛中的泰坦尼克号生存预测任务。从数据获取到清洗、重构,再到分析和机器学习建模的过程。利用Python进行数据处理,并采用KNN算法进行预测。

Kaggle竞赛:泰坦尼克号灾难数据分

https://www.kaggle.com/c/titanic

  • 目标确定:根据已有数据预测未知旅客生死
  • 数据准备
    • 数据获取,载入训练集csv、测试集csv
    • 数据清洗,补齐或抛弃缺失值,数据类型变换(字符串转数字)
    • 数据重构,根据需要重新构造数据(重组数据,构建新特征)
  • 数据分析
    • 描述性分析,画图,直观分析
    • 探索性分析,机器学习模型
  • 成果输出:csv文件上传得到正确率和排名

载入库

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

数据获取

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.head() # 显示头几行数据
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th…female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
test.head()# 显示头几行数据
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS

数据概览

train.shape, test.shape # 查看数据的行数,列数
((891, 12), (418, 11))
train.info() # 查看具体信息字段
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

train.csv 具体数据格式

  • PassengerId 乘客ID
  • Survived 是否幸存。0遇难,1幸存
  • Pclass 船舱等级,1Upper,2Middle,3Lower
  • Name 姓名,object——————————
  • Sex 性别,object—————————
  • Age 年龄 缺失177——m————————
  • SibSp 兄弟姐妹及配偶个数
  • Parch 父母或子女个数
  • Ticket 乘客的船票号,object————————
  • Fare 乘客的船票价
  • Cabin 乘客所在舱位,object,缺失687———————
  • Embarked 乘客登船口岸,object,缺失3————————
train.head() # head()方法查看头部几行信息,如果打train则返回所有数据列表
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th…female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS



数据清洗

缺失过多或无关值抛弃

# .loc 通过自定义索引获取数据 , 其中 .loc[:,:]中括号里面逗号前面的表示行,逗号后面的表示列
train2 = train.loc[:,['PassengerId','Survived','Pclass','Sex','Age','SibSp','Parch','Fare']]
test2 = test.loc[:, ['PassengerId','Pclass','Sex','Age','SibSp','Parch','Fare']]
train2.head()
PassengerIdSurvivedPclassSexAgeSibSpParchFare
0103male22.0107.2500
1211female38.01071.2833
2313female26.0007.9250
3411female35.01053.1000
4503male35.0008.0500
test2.head()
PassengerIdPclassSexAgeSibSpParchFare
08923male34.5007.8292
18933female47.0107.0000
28942male62.0009.6875
38953male27.0008.6625
48963female22.01112.2875
train2.info(), test2.info()
test2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Fare           417 non-null float64
dtypes: float64(2), int64(4), object(1)
memory usage: 22.9+ KB


填充年龄空值

age = train2['Age'].median() # 年龄中位数
age
28.0
train2['Age'].isnull() # 空值转bool值
0      False
1      False
2      False
3      False
4      False
5       True
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17      True
18     False
19      True
20     False
21     False
22     False
23     False
24     False
25     False
26      True
27     False
28      True
29      True
       ...  
861    False
862    False
863     True
864    False
865    False
866    False
867    False
868     True
869    False
870    False
871    False
872    False
873    False
874    False
875    False
876    False
877    False
878     True
879    False
880    False
881    False
882    False
883    False
884    False
885    False
886    False
887    False
888     True
889    False
890    False
Name: Age, Length: 891, dtype: bool
train2.loc[train2['Age'].isnull(), 'Age'] = age # 为train2年龄为空值的填充年龄中位数
train2.info()
test2.loc[test2['Age'].isnull(), 'Age'] = age # 为test2中年龄为空值的数据填充年龄中位数
test2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Sex            418 non-null object
Age            418 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Fare           417 non-null float64
dtypes: float64(2), int64(4), object(1)
memory usage: 22.9+ KB


填充船票价格空值

#取众数填充船票价格 Fare

Fare = test2['Fare'].mode()
Fare

test2.loc[test['Fare'].isnull(),'Fare'] = Fare[0]

train2.info(),test2.info()
train2.head()
PassengerIdSurvivedPclassSexAgeSibSpParchFare
0103male22.0107.2500
1211female38.01071.2833
2313female26.0007.9250
3411female35.01053.1000
4503male35.0008.0500


数据类型转换

train2.dtypes,test2.dtypes # 列数据类型
(PassengerId      int64
 Survived         int64
 Pclass           int64
 Sex             object
 Age            float64
 SibSp            int64
 Parch            int64
 Fare           float64
 dtype: object, PassengerId      int64
 Pclass           int64
 Sex             object
 Age            float64
 SibSp            int64
 Parch            int64
 Fare           float64
 dtype: object)

性别转换成整型数据

train2['Sex'] = train2['Sex'].map({'female':0, 'male':1}).astype(int)
test2['Sex'] = test2['Sex'].map({'female': 0, 'male': 1}).astype(int)
train2.head()
PassengerIdSurvivedPclassSexAgeSibSpParchFare
0103122.0107.2500
1211038.01071.2833
2313026.0007.9250
3411035.01053.1000
4503135.0008.0500


数据重构

将SibSp、Parch特征构建两个新特征

  • 家庭人口总数 familysize
  • 是否单身 isalone
train2.loc[:,'SibSp'] #兄妹个数
train2.loc[:,'Parch'] #父母子女个数

train2['familysize'] = train2.loc[:,'SibSp'] + train2.loc[:,'Parch'] + 1
test2['familysize'] = test2.loc[:,'SibSp'] + test2.loc[:,'Parch'] + 1
train2.head()
PassengerIdSurvivedPclassSexAgeSibSpParchFarefamilysize
0103122.0107.25002
1211038.01071.28332
2313026.0007.92501
3411035.01053.10002
4503135.0008.05001
train2['isalone'] = 0
train2.loc[train2['familysize'] == 1,'isalone'] = 1
train2.head()
PassengerIdSurvivedPclassSexAgeSibSpParchFarefamilysizeisalone
0103122.0107.250020
1211038.01071.283320
2313026.0007.925011
3411035.01053.100020
4503135.0008.050011


数据重构后的最终数据

train3 = train2.loc[:,['PassengerId','Survived','Pclass','Sex','Age','Fare','familysize','isalone']]
train3.head()
test3 = test2.loc[:,['PassengerId','Pclass','Sex','Age','Fare','familysize','isalone']]
test3.head()
PassengerIdPclassSexAgeFarefamilysizeisalone
08923134.57.82921NaN
18933047.07.00002NaN
28942162.09.68751NaN
38953127.08.66251NaN
48963022.012.28753NaN




数据分析

描述性分析

#单身存活率
d = train3[['isalone', 'Survived']].groupby(['isalone']).mean()
d
# d.loc[0,'Survived']
Survived
isalone
00.505650
10.303538
#单身与否死亡率

plt.bar(
    [0,1],
    [1-d.loc[0,'Survived'],1-d.loc[1,'Survived']],
    0.5,
    color='r',
    alpha=0.5,
)

plt.xticks([0,1],['notalone','alone'])

plt.show()

这里写图片描述

#男性女性存活率
n = train3[['Sex', 'Survived']].groupby(['Sex']).mean()
n
Survived
Sex
00.742038
10.188908
# 不同性别死亡率条形图

plt.bar(
    [0,1],
    [1-n.loc[0,'Survived'],1-n.loc[1,'Survived']],
    0.5,
    color='g',
    alpha=0.7
)

plt.xticks([0,1],['female','male'])

plt.show()

这里写图片描述

#仓位存活率
c = train3[['Pclass', 'Survived']].groupby(['Pclass']).mean()
c
Survived
Pclass
10.629630
20.472826
30.242363
#三等仓位死亡率条形图

plt.bar(
    [0,1,2],
    [1-c.loc[1,'Survived'],1-c.loc[2,'Survived'],1-c.loc[3,'Survived']],
    0.5,
    color='b',
    alpha=0.7
)

plt.xticks([0,1,2],[1,2,3])

plt.show()

这里写图片描述

#年龄存活率
age = train3[['Age', 'Survived']].groupby(['Age']).mean()
age
Survived
Age
0.421.000000
0.671.000000
0.751.000000
0.831.000000
0.921.000000
1.000.714286
2.000.300000
3.000.833333
4.000.700000
5.001.000000
6.000.666667
7.000.333333
8.000.500000
9.000.250000
10.000.000000
11.000.250000
12.001.000000
13.001.000000
14.000.500000
14.500.000000
15.000.800000
16.000.352941
17.000.461538
18.000.346154
19.000.360000
20.000.200000
20.500.000000
21.000.208333
22.000.407407
23.000.333333
44.000.333333
45.000.416667
45.500.000000
46.000.000000
47.000.111111
48.000.666667
49.000.666667
50.000.500000
51.000.285714
52.000.500000
53.001.000000
54.000.375000
55.000.500000
55.500.000000
56.000.500000
57.000.000000
58.000.600000
59.000.000000
60.000.500000
61.000.000000
62.000.500000
63.001.000000
64.000.000000
65.000.000000
66.000.000000
70.000.000000
70.500.000000
71.000.000000
74.000.000000
80.001.000000

88 rows × 1 columns

#不同年龄存活率

plt.figure(2, figsize=(20,5))
plt.bar(
    age.index,
    age.values,
    0.5,
    color='r',
    alpha=0.7
)
# plt.axis([0,80,0,20])
plt.xticks(age.index,rotation=90)

plt.show()

这里写图片描述

#票价存活率
fare = train3[['Fare', 'Survived']].groupby(['Fare']).mean()
fare
Survived
Fare
0.00000.066667
4.01250.000000
5.00000.000000
6.23750.000000
6.43750.000000
6.45000.000000
6.49580.000000
6.75000.000000
6.85830.000000
6.95000.000000
6.97500.500000
7.04580.000000
7.05000.000000
7.05420.000000
7.12500.000000
7.14171.000000
7.22500.250000
7.22920.266667
7.25000.076923
7.31250.000000
7.49580.333333
7.52080.000000
7.55000.250000
7.62920.000000
7.65000.250000
7.72500.000000
7.72920.000000
7.73330.500000
7.73750.500000
7.74170.000000
80.00001.000000
81.85831.000000
82.17080.500000
83.15831.000000
83.47500.500000
86.50001.000000
89.10421.000000
90.00000.750000
91.07921.000000
93.50001.000000
106.42500.500000
108.90000.500000
110.88330.750000
113.27500.666667
120.00001.000000
133.65001.000000
134.50001.000000
135.63330.666667
146.52081.000000
151.55000.500000
153.46250.666667
164.86671.000000
211.33751.000000
211.50000.000000
221.77920.000000
227.52500.750000
247.52080.500000
262.37501.000000
263.00000.500000
512.32921.000000

248 rows × 1 columns

plt.figure(2, figsize=(20,5))
plt.bar(
    fare.index,
    fare.values,
    0.5,
    color='r',
    alpha=0.7
)
# plt.axis([0,80,0,20])
plt.xticks(fare.index,rotation=90)

plt.show()

这里写图片描述



得出结论

# 单身死亡率70%
jieguo = pd.DataFrame(np.arange(0,418),index=test3.loc[:,'PassengerId'])
jieguo.loc[:,0] = 1
jieguo.head()
0
PassengerId
8921
8931
8941
8951
8961
jieguo.loc[test3[test3.loc[:,'isalone'] == 1].loc[:,'PassengerId'].values] = 0 #单身死
jieguo.head()
0
PassengerId
8921
8931
8941
8951
8961



输出结论

jieguo.to_csv('isalone.csv')
#判断:男性全死,女性全活,三等仓全死
new3 = pd.DataFrame(np.arange(0,418),index=test3.loc[:,'PassengerId'].values)
new3[0] = 0 #默认全死
new3.head()
0
8920
8930
8940
8950
8960
new3.loc[test3[test3.loc[:,'Sex'] == 0].loc[:,'PassengerId'].values] = 1 #女性活
new3.head()
0
8920
8931
8940
8950
8961
new3.loc[test2[test2.loc[:,'Pclass'] == 3].loc[:,'PassengerId'].values] = 0 #三等仓死
new3.head()
0
8920
8930
8940
8950
8960
#写入csv上传
new3.to_csv('cangwei-xingbie.csv')#判断:男性全死,女性全活,三等仓全死


机器学习建模

train3.head()
PassengerIdSurvivedPclassSexAgeFarefamilysizeisalone
0103122.07.250020
1211038.071.283320
2313026.07.925011
3411035.053.100020
4503135.08.050011
from sklearn import neighbors,datasets
x = train3.loc[:,['Pclass','Sex','familysize']]
y = train3.loc[:,'Survived'] #生死

clf = neighbors.KNeighborsClassifier(n_neighbors = 20)
clf.fit(x,y) #knn训练
clf
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=20, p=2,
           weights='uniform')
#knn预测
z = clf.predict(test3.loc[:,['Pclass','Sex','familysize']])
z
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0,
       1, 0, 0, 0], dtype=int64)
# 构造表
s = np.arange(892, 1310)
s
results = pd.DataFrame(z, index=s)
results.head()
0
8920
8930
8940
8950
8961
# 写入csv上传
results.to_csv('Titanic_knn.csv')
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值