基于随机森林算法实现泰坦尼克号的生存预测，python

最新推荐文章于 2023-01-12 10:58:54 发布

原创最新推荐文章于 2023-01-12 10:58:54 发布 · 859 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#python #算法 #随机森林

python 同时被 2 个专栏收录

50 篇文章

订阅专栏

机器学习

34 篇文章

订阅专栏

本文介绍了如何在Kaggle泰坦尼克号数据集上应用机器学习，特别是使用随机森林模型预测乘客的生存情况。首先，通过填充缺失值（如年龄和登船口）来预处理数据，接着将非数值特征（如性别和登船口）转化为数值形式。然后，使用GridSearchCV进行参数调优，找到最佳模型参数。最终，利用训练好的模型对测试集进行预测，并将结果写入文件中。

在Kaggle有这样一个经典的题目，根据船上的用户基本信息，判断剩下的人是否能生存下来。话不多说直接进入主题。

文章目录

下载数据集

Kaggle
或
数据集下载

在这里插入图片描述
包含了源代码+训练集+ 测试集

整理数据

这一部主要处理缺失的数据，
将年龄等常数用平均值代替
将登船口用众数代替

def select_data():
    selected_features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare','Embarked']
    train_data = load_data()
    test_data = load_data("test")
    train_x = train_data[selected_features]
    train_y = train_data["Survived"]
    test_x = test_data[selected_features]
    
    train_x["Age"].fillna(train_x["Age"].mean(), inplace = True)
    
    train_x['Embarked'].fillna('S',inplace=True) #'S'出现次数最多，因此以'S'进行填充
    test_x["Age"].fillna(test_x["Age"].mean(), inplace = True)
    test_x["Fare"].fillna(test_x["Fare"].mean(), inplace = True)
    
    train_x = format_data(train_x)
    test_x = format_data(test_x)
    print(test_x.info())
    return train_x, train_y, test_x

数据数字化

将性别，登机口用数学的形式来表示，方便训练

def format_data(train_x):
    # 数据化性别
    train_x.loc[train_x['Sex'] == "male", "Sex"] = 0
    train_x.loc[train_x["Sex"] == "female", "Sex"] = 1
    
    train_x.loc[train_x['Embarked'] == "S", "Embarked"] = 0
    train_x.loc[train_x["Embarked"] == "C", "Embarked"] = 1
    train_x.loc[train_x['Embarked'] == "Q", "Embarked"] = 2
    return train_x

训练模型，并预测，写入到文件当中

def random_forest():
    test_data = load_data('test')
    x_train, y_train, x_test = select_data()
    model = RandomForestClassifier()
    paras = {'n_estimators': np.arange(10, 100, 10), 'criterion': ['gini', 'entropy'], 'max_depth': np.arange(5, 50, 5)}
    gs = GridSearchCV(model, paras, cv=5, verbose=1,n_jobs=-1)
    gs.fit(x_train, y_train)
    y_pre = gs.predict(x_test)
    print('best score:', gs.best_score_)
    print('best parameters:', gs.best_params_)
    print((test_data))
    result = ''
    with open('./result.csv', 'w', encoding="utf-8") as f:
        f.write("PassengerId,Survived" + "\n")
        for i in range(len(y_pre)): 
           
            result = str(test_data.iloc[i,0]) + "," +  str(y_pre[i])
            f.write(result + "\n")