有些计算步骤写的不是很简要
可能有错误的地方以及理解不到位的地方
1、男女乘客获救率有无明显差异
result_survived = df.loc[:,['Sex','Survived']].groupby(['Sex','Survived']).size()
result_survived
结果
Sex Survived
female 0 81
1 233
male 0 468
1 109
dtype: int64
#女性获救率
result_survived[(('female',1))]/(result_survived[(('female',1))] + result_survived[(('female',0))])
结果 0.7420382165605095
#男性获救率
result_survived[(('male',1))]/(result_survived[(('male',1))] + result_survived[(('male',0))])
结果 0.18890814558058924
2.船上是否有明显贫富差异
#求极差
df['Fare'] = df['Fare'].apply(lambda x:x * 8.5085)
df['Fare'].max() - df['Fare'].min()
结果 4359.1529982
3.有亲戚在船上是否对获救率有影响
def change_sib(x):
if x >0:
return "Y"
else:
return "N"
df_copy['SibSp'] = df_copy['SibSp'].apply(lambda x:change_sib(x))
result_sibsp = df_copy.loc[:,['SibSp','Survived']].groupby(['SibSp','Survived']).size()
result_sibspresult_sibsp = df_copy.loc[:,['SibSp','Survived']].groupby(['SibSp','Survived']).size()
result_sibsp
结果
SibSp Survived
N 0 398
1 210
Y 0 151
1 132
dtype: int64
result_sibsp[(('Y', 1))]/(result_sibsp[(('N', 1))] + result_sibsp[(('Y', 1))])#有亲属的生存者占比所有的幸存者
result_sibsp[(('Y', 0))]/(result_sibsp[(('N', 0))] + result_sibsp[(('Y', 0))])#有亲属的死亡人数占比所有的死亡人数
result_sibsp[(('Y', 1))]/(result_sibsp[(('Y', 0))] + result_sibsp[(('Y', 1))])#有亲戚的人的获救率
result_sibsp[(('N', 1))]/(result_sibsp[(('N', 0))] + result_sibsp[(('N', 1))])#没有亲戚的获救率
4.不同登船地点的男女分布情况#进行数据预处理
df['Embarked'].unique()#查看列表可能含有的值
df['Embarked'].value_counts()#统计各个数值的个数
结果
S 644
C 168
Q 77
Name: Embarked, dtype: int64
df['Embarked'] = df['Embarked'].fillna('S')#将缺失值填成众数
result_embarked = df.loc[:,['Embarked','Sex']].groupby(['Embarked','Sex']).size()
result_embarked
结果
Embarked Sex
C female 73
male 95
Q female 36
male 41
S female 205
male 441
dtype: int64
5.不同登船地点是否反应该地区的经济状况
df.loc[:,['Embarked','Fare']].groupby('Embarked').mean()
结果
Fare
Embarked
C 510.119835
Q 112.959100
S 231.802608
df.loc[:,['Embarked','Fare']].groupby('Embarked').max()
结果
Fare
Embarked
C 4359.152998
Q 765.765000
S 2237.735500
6.不同年龄段的获救率
df1[df1['Age']>0]['Age'].mean()
结果为 29.69911764705882
df1.loc[(df1['Age'] > 0 ) & (df1['Age']<=12),'Age']= 1
df1.loc[(df1['Age'] > 12 ) & (df1['Age']<=24),'Age']= 2
df1.loc[(df1['Age'] > 24 ) & (df1['Age']<=45),'Age'] =3
df1.loc[(df1['Age'] > 45 ) & (df1['Age']<=60),'Age'] =4
df1.loc[(df1['Age'] > 60 ),'Age']=5
df_age = df1.loc[:,['Age','Survived']].groupby(['Age','Survived']).size().to_dict()
df_age
结果
{(1.0, 0): 29,
(1.0, 1): 40,
(2.0, 0): 130,
(2.0, 1): 78,
(3.0, 0): 325,
(3.0, 1): 186,
(4.0, 0): 48,
(4.0, 1): 33,
(5.0, 0): 17,
(5.0, 1): 5}
然后计算各年龄阶段的获救率