数据科学工程师面试宝典系列之二---Python机器学习kaggle案例：泰坦尼克号船员获救预测_您的工作是创建一个回归模型,该模型将有助于预测未来船舶将需要多少船员。-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_36143300/article/details/54933662

本文是数据科学工程师面试系列的第二部分，主要使用Python的Numpy, Pandas和Scikit-learn库进行泰坦尼克号数据的机器学习分析。介绍了数据的特性，如乘客信息，进行了数据预处理，并采用随机样本和决策树建立回归模型，通过特征选择优化模型性能。" 113311505,10545894,使用IGOR绘制密度图,"['数据可视化', 'IGOR软件', '柱状图', '编程', '图像处理']

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1.Python机器学习kaggle案例

Numpy-python科学计算库；Pandas-python数据分析处理库；Scikit-learn-python机器学习库；

2.泰坦尼克号数据介绍

乘客编号、是否幸存、等级、姓名、性别、年龄、兄弟姐妹个数、带老人孩子个数、船票、船票价格、上船地点；

3.数据预处理

import pandas  #ipython notebook
titanic = pandas.read_csv("titanic_train.csv")
#titanic.head(3) //前3行打印出来
print titanic.describe()  //统计特性：count、mean、std、min、25%、50%、75%、max

titanic ["Age"] = titanic ['Age'] . fillna(titanic['Age'].median())  //Age列中的缺失值用Age均值进行填充
printf  titanic.describe()

print  titanic ["Sex"].unique()                               //male用0，female用1
#Replace all the occurences of male with the number 0.
titanic.loc[titanic["Sex"] == "male","Sex"] = 0
titanic.loc[titanic["Sex"] == "female","Sex"] = 1

print titanic ["Embarked"].unique()
titanic["Embarked"] = titanic["Embarked"].fillna('S')     //缺失值用最多的S进行填充
titanic.loc[titanic["Embarked"] == "S","Embarked"] = 0    //地点用0,1,2
titanic.loc[titanic["Embarked"] == "C","Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q","Embarked"] = 2

4.回归模型

#Import the linear regression class
from sklearn.linear_model import LinearRegression   //线性回归
#Sklearn also has a helper that makes it easy to do cross validation
from sklearn.cross_validation import KFold    //训练集交叉验证，得到平均值

#The columns we'll use to predict the target
predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]