【Kaggle实战记录01】Titanic - Machine Learning

作者分享了在KaggleTitanic生存预测竞赛中的初步实践,包括数据探索、特征工程、模型训练与比较。文章详细介绍了如何处理数据缺失值,创建新特征,如船舱等级、性别、年龄与生存率的关系,并运用多种机器学习模型(如逻辑回归、支持向量机、随机森林等)进行预测,最后对比了不同模型的预测准确性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >


title : 【Kaggle实战记录01】Titanic - Machine Learning from Disaster
date : 2023/4/7
tags : 机器学习,练习记录
author : Linno


【Kaggle实战记录01】Titanic - Machine Learning

太久没写过博客了,这是一个新系列,用当时搞Kaggle的代码写文章(不定时更新)

求生欲声明:本人很菜,博客仅供参考。

传送门:https://www.kaggle.com/competitions/titanic

题目描述

这是一个二分类问题,训练集和测试集均会提供以下特征:

PassengerId : 乘客编号
Pclass : 船舱等级
Name : 船员姓名
Sex : 船员性别
Age : 船员年龄
SibSp : 在船上的兄弟姐妹和配偶个数
Parch : 在船上的父母和孩子个数
Ticket : 船票编号
Fare : 票价
Cabin : 所在船舱编号(很多缺失)

训练集会提供标签值Survived,表示最后是否存活

**预测任务:**最后用你训练的模型预测一些乘客的存活情况,并提交预测csv文件,按准确率给分。

导入库和数据集

我们要用到可视化的一些库,以及常用的机器学习模式都可以尝试效果,这里用的是sklearn框架做例子。

# 导入库
import pandas as pd
import numpy as np
import random as rnd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# 消除警告
import warnings
warnings.filterwarnings('ignore')

# 机器学习模型
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC,LinearSVC
from sklearn.ensemble import RandomForestClassifie
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

读训练和测试集。

# 导入数据
train_df = pd.read_csv('D:/Documents/titanic/train.csv')
test_df = pd.read_csv('D:/Documents/titanic/test.csv')

train_df.head() #这里用来预览字段结构和类型
# test_df.head() 同理

查看每个特征的非空样本量和字段类型

train_df.info()
print("*"*40)
test_df.info()

查看数值类型特征(比如int和float)的数据分布情况

train_df.describe()

查看非数值类(比方说object类型)特征的数据分布情况

train_df.describe(include=["O"])
上面的步骤看个大概,了解数据大概是什么情况就Ok了。

数据预处理

下面包括了数据清洗和数据可视化两种。

床舱等级与生存量的关系

很自然地想到高级床舱的存活率也许是比较高的,作图出来关系也一目了然。

# 创建船舱等级与生存量列联表
Pclass_Survived = pd.crosstab(train_df['Pclass'],train_df['Survived'])
print(Pclass_Survived)
# 绘制船舱等级与生产量条形图
Pclass_Survived.plot(kind='bar')
plt.xticks(rotation=360)
plt.show()
# 查看不同船舱等级生存率条形图
Pclass_Survived.plot(kind='line')
Pclass_Survived = train_df[["Pclass","Survived"]].groupby(["Pclass"],as_index=True).mean().sort_values(by="Pclass",ascending=False)
plt.xticks(range(1,4)[::1])
plt.show()
性别与生存量的关系

同样地对性别进行同样处理,果然女性的存活概率是更高的。

# 创建性别和生存量列联表
Sex_Survived = pd.crosstab(train_df['Sex'],train_df['Survived'])
Sex_Survived
Sex_Survived.plot(kind='bar')
plt.xticks(rotation=360)
plt.show() #横坐标0,1分别表示男性和女性

# 查看性别与生存率
train_df[["Sex","Survived"]].groupby(["Sex"],as_index=False).mean().sort_values(by="Survived",ascending=False)
对年龄数据进行缺失值处理

缺失值使用中位数代替。

# 处年龄缺失情况(用中位数代替缺失数据)
Agemedian=train_df['Age'].median() # 用年龄的中位数代替年龄缺失值
train_df.Age.fillna(Agemedian,inplace=True) # 在当前表填充缺失值
train_df.reset_index(inplace = True) #重置索引

年龄应该分个层先,不然太少人的年龄段统计存活率也没啥意义。

# 对年龄进行分组,绘制年龄与幸存数量条形图
bins = [0,9,18,27,36,45,54,63,72,81,90]
train_df['GroupAge'] = pd.cut(train_df.Age,bins)

GroupAge_Survived = pd.crosstab(train_df['GroupAge'],train_df['Survived'])
GroupAge_Survived.plot(kind='bar',figsize=(10,6))

plt.xticks(rotation=360)
plt.title('Survived status by GroupAge')

# 绘制不同年龄对应生存率折线图

# 不同年龄段幸存数
GroupAge_Survived_1 = GroupAge_Survived[1]
# 不同年龄段幸存率
GroupAge_all = GroupAge_Survived.sum(axis=1)
GroupAge_Survived_rate = round(GroupAge_Survived_1/GroupAge_all,2)
GroupAge_Survived_rate.plot(figsize=(10,6))
plt.show()

可以看出老人和小孩是存活率更高的(有一段太少人拉低了)

重复处理其他数值类型特征

代码是如出一辙的就不用过多讲解了,贴在下面。

# 创建兄弟姐妹及配偶数量与生存量列联表
SibSp_Survived = pd.crosstab(train_df['SibSp'],train_df['Survived'])
SibSp_Survived
SibSp_Survived.plot(kind='bar')
plt.xticks(rotation=360)
plt.show()
# 查看兄弟姐妹配偶数量和生存率的关系
SibSp_Survived = train_df[["SibSp","Survived"]].groupby(["SibSp"],as_index=True).mean().sort_values(by="SibSp")
SibSp_Survived.plot(kind='line')
plt.show()
# 创建父母与孩子数量与生存量列联表
Parch_Survived = pd.crosstab(train_df['Parch'],train_df['Survived'])
Parch_Survived
Parch_Survived.plot(kind='bar')
plt.xticks(rotation=360)
plt.show()
# 查看父母与孩子数与生存率的关系
Parch_Survived = train_df[["Parch","Survived"]].groupby(["Parch"],as_index=True).mean().sort_values(by="Parch")
Parch_Survived.plot(kind='line')
plt.show()
# 查看不同票价对应与幸存情况的关系
# 划分船票价格,创建不同的船票对应生存量列联表
bins = [0,60,120,180,240,300,360,420,480,540,600]
train_df['GroupFare']=pd.cut(train_df.Fare,bins,right=False)
GroupFare_Survived = pd.crosstab(train_df['GroupFare'],train_df['Survived'])
GroupFare_Survived
# 绘制不同船票价格对应幸存量簇状柱形图
GroupFare_Survived.plot(kind='bar',figsize=(10,6)) #绘制簇状柱形图
plt.xticks(rotation=360) #调整刻度
plt.title('Survived status by GroupFare')

GroupFare_Survived.iloc[2:].plot(kind='bar',figsize=(10,6))
plt.xticks(rotation=360)
plt.title('Survived status by GroupFare(Fare>=120)')
# 绘制不同票价对应生存率折线图

GroupFare_Survived_1 = GroupFare_Survived[1] #不同票价对应存活数
GroupFare_all = GroupFare_Survived.sum(axis=1)
GroupFare_Survived_rate = round(GroupFare_Survived_1/GroupFare_all,2)
GroupFare_Survived_rate.plot()
plt.show()

分析完了之后可以把创建的年龄分组和票价分组连同编号给去了。

train_df=train_df.drop(["index","GroupAge","GroupFare"],axis=1)
train_df.head()
其他无关字段

因为Ticket表示船票的名字,与乘客的生存率无关联。 删除Cabin是因为通过HasCabin来代替Cabin字段。

# 讲乘客是否有船舱分为两类
train_df['HasCabin'] = train_df["Cabin"].apply(lambda x: 0 if type(x)==float else 1)
test_df['HasCabin'] = test_df["Cabin"].apply(lambda x: 0 if type(x)==float else 1)
train_df.head()
train_df = train_df.drop(["Ticket","Cabin"],axis=1)
test_df = test_df.drop(["Ticket","Cabin"],axis=1)
combine = [train_df,test_df]
print(train_df.shape,test_df.shape,combine[0].shape,combine[1].shape)
姓名也有一部分关联!!!

注意到了吗?姓名里的Mr,Mrs这些显然也是一部分相关因素。使用名字字段作为特征之一是因为姓名带有该乘客的头衔,姓名越长对应头衔可能越长,相对社会地位可能较高。

# 创建训练集和测试集姓名长度字段
train_df['NameLength'] = train_df['Name'].apply(len)
test_df['NameLength'] = test_df['Name'].apply(len)

根据姓名创建称号特征,会包含性别和阶层信息

dataset.Name.str.extract(‘([A-Za-z]+).’) 把空格开头、结尾的字符串提取出来和性别匹配,看各类称号分别属于男or女,方便后续归类。

for dataset in combine:
    dataset['Title'] = dataset.Name.str.extract('([A-Za-z]+)\.',expand=False)
pd.crosstab(train_df['Title'],train_df['Sex']).sort_values(by=["male","female"],ascending=False)

把称号归类为Mr,Miss,Mrs,Master,Rare_Male,Rare_Female(按男性和女性区分了Rare)

for dataset in combine:
    dataset["Title"] = dataset["Title"].replace(['Lady','Countess','Dona'],"Rare_Female")
    dataset["Title"] = dataset["Title"].replace(['Capt','Col','Don','Dr','Major','Rev','Sir','Jonkheer',],'Rare_Male')
    dataset["Title"] = dataset["Title"].replace('Mlle','Miss')
    dataset["Title"] = dataset["Title"].replace('Ms','Miss')
    dataset["Title"] = dataset["Title"].replace('Mme',"Miss")

绘制不同称号对应的存活率

# 按Title汇总计算Survived均值,查看相关性
T_S = train_df[["Title","Survived"]].groupby(["Title"],as_index=False).mean().sort_values(by='Survived',ascending=True)
plt.figure(figsize=(10,6))
plt.bar(T_S['Title'],T_S['Survived'])

PS:称为Miss、Mrs、Rare_Female的乘客存活率居高,看来在逃生时,大家遵循女士优先的原则。

# 将Title 特征映射成数值
title_mapping = {"Mr":1,"Miss":2,"Mrs":3,"Master":4,"Rare_Female":5,"Rare_Male":6}
for dataset in combine :
    dataset["Title"] = dataset["Title"].map(title_mapping)
    dataset["Title"] = dataset["Title"].fillna(0)
    # 为了避免有空数据的常规操作·
train_df.head()

完成了信息的挖掘,删除名字字段

train_df = train_df.drop(["Name","PassengerId"],axis=1)
test_df = test_df.drop(['Name'],axis=1)
train_df.head()

# 每次删除特征时都要重新combine
combine = [train_df,test_df]
combine[0].shape,combine[1].shape

将性别字段转化为数值,女性为0,男性为1

for dataset in combine :    dataset["Sex"]=dataset["Sex"].map({"female":1,"male":0}).astype(int) #加astype(int)避免类型错误
train_df.head()
guess_ages = np.zeros((6,3)) #对年龄字段进行空值处理
guess_ages 

# 对age年龄字段的空值
# 使用相同Pclass和Title的Age中位数来代替(对于中位数为空的组合,使用Title整体的中位数来代替)
for dataset in combine:
    #取六种组合的中位数
    for i in range(0,6):
        for j in range(0,3):
            guess_title_df = dataset[dataset["Title"]==i+1]["Age"].dropna()
            guess_df = dataset[(dataset["Title"]==i+1)&(dataset["Pclass"]==j+1)]['Age'].dropna()
            # age_mean = guess_df.mean()
            # age_std = guess_df.mean()
            # age_guess = rnd.uniform(age_mean - age_std,age_mean + age_std)
            age_guess = guess_df.median() if ~np.isnan(guess_df.median()) else guess_title_df.median()
            #print(i,j,guess_df.median(),guess_title_df.median(),age_guess)
            guess_ages[i,j]=int(age_guess/0.5+0.5)*0.5
    # 给满足6种情况的Age字段赋值
    for i in range(0,6):
        for j in range(0,3):
            dataset.loc[(dataset.Age.isnull()) & (dataset.Title == i+1) & (dataset.Pclass == j+1),'Age'] = guess_ages[i,j]
    dataset['Age'] = dataset['Age'].astype(int)
    
    
train_df.head()
划分年龄层次

创建年龄区间特征。

pd.cut是按值的大小均匀切分,每组值区间大小相同,但样本数可能不一致;pd.qcut是按照样本在值上的分布频率切分,每组样本数相同

train_df['AgeBand'] = pd.qcut(train_df['Age'],8)
print(df.head()) 
train_df[['AgeBand','Survived']].groupby(['AgeBand'],as_index=False).mean().sort_values(by='AgeBand',ascending=True)
将年龄区间转化为数值
for dataset in combine:
    dataset.loc[dataset['Age'] <= 17,'Age'] = 0
    dataset.loc[(dataset['Age'] > 17) & (dataset['Age'] <= 21),'Age'] = 1
    dataset.loc[(dataset['Age'] > 21) & (dataset['Age'] <= 25),'Age'] = 2
    dataset.loc[(dataset['Age'] > 25) & (dataset['Age'] <= 26),'Age'] = 3
    dataset.loc[(dataset['Age'] > 26) & (dataset['Age'] <= 31),'Age'] = 4
    dataset.loc[(dataset['Age'] > 31) & (dataset['Age'] <= 36.5),'Age'] = 5
    dataset.loc[(dataset['Age'] > 36.5) & (dataset['Age'] <= 45),'Age'] = 6
    dataset.loc[(dataset['Age'] > 45),'Age'] = 7
train_df.head()
移除AgeBand特征
train_df = train_df.drop('AgeBand',axis=1)
combine = [train_df,test_df]
train_df.head()
创建家庭规模FamilySize组合特征。
for dataset in combine:
    dataset["FamilySize"] = dataset["Parch"]+dataset["SibSp"] + 1
FamilySize_Survived = train_df[["FamilySize","Survived"]].groupby(["FamilySize"],as_index = True).mean().sort_values(by="FamilySize")
FamilySize_Survived.plot(kind="line")
plt.xticks(range(12)[::1])
plt.show()
创建是否独自一人IsAlone特征
for dataset in combine : 
    dataset["IsAlone"] = 0
    dataset.loc[dataset["FamilySize"]==1,"IsAlone"]=1
IsAlone_Survived = train_df[["IsAlone","Survived"]].groupby(["IsAlone"],as_index=True).mean().sort_values(by="IsAlone")
IsAlone_Survived.plot(kind="bar")
plt.xticks(rotation=360)
plt.show()
移除Parch,Sibsp 字段
train_df = train_df.drop(["Parch","SibSp"],axis=1)
test_df = test_df.drop(["Parch","SibSp"],axis=1)
combine = [train_df,test_df]
train_df.head()
港口因素
# 给Embarked补充空值/获取上船最多的港口
freq_port = train_df["Embarked"].dropna().mode()[0]
freq_port
处理缺失值(利用众数填充缺失值)!!!
for dataset in combine :
    dataset["Embarked"] = dataset["Embarked"].fillna(freq_port)

# 创建列联表
Embarked_Survived = pd.crosstab(train_df['Embarked'],train_df['Survived'])
Embarked_Survived

绘制不同港口对应的幸存量条形图

Embarked_Survived.plot(kind='bar')
plt.xticks(rotation=360)
plt.show()

不同港口与幸存量的关系

Embarked_Survived = train_df[["Embarked","Survived"]].groupby(["Embarked"],as_index=True).mean().sort_values(by="Embarked",ascending=True)
Embarked_Survived.plot(kind="bar")
plt.xticks(rotation=360)
plt.show()

把Embarked数字化

for dataset in combine:
    dataset["Embarked"] = dataset["Embarked"].map({"S":0,"C":1,"Q":2}).astype(int)
train_df.head()

给测试集中的Fare(票价)填充空值,使用中位数

test_df["Fare"].fillna(test_df["Fare"].dropna().median(),inplace=True)
test_df.info()

创建FareBand区间特征

train_df["FareBand"] = pd.qcut(train_df["Fare"],4)
train_df[["FareBand","Survived"]].groupby(["FareBand"],as_index=False).mean().sort_values(by="FareBand",ascending=True)

将不同票价所在区间数字化

# 根据FareBand将Fare特征转化为序数值
for dataset in combine:
    dataset.loc[dataset['Fare'] <=7.91,'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454),'Fare'] = 1
    dataset.loc[(dataset['Fare'] >14.454) & (dataset['Fare'] <=31),'Fare'] = 2
    dataset.loc[dataset['Fare']>31,'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

# 移除FareBand
train_df = train_df.drop(['FareBand'],axis=1)
combine = [train_df,test_df]

train_df.head(10)
特征相关性的可视化
# 用seaborn的heatmap对特征之间的相关性进行可视化
colormap = plt.cm.RdBu
plt.figure(figsize=(14,12))
plt.title('Pearson Correlation of Features',y=1.05,size=15)
sns.heatmap(train_df.astype(float).corr(),linewidths=0.1,vmax=1.0,
            square=True,cmap=colormap,linecolor='white',annot=True)
plt.show()

模型训练和比较

开始准备训练数据

X_train = train_df.drop(['Survived'],axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId",axis=1).copy()
X_train

# X_train.shape,Y_train.shape,X_test.shape

Logistic Regression 逻辑回归模型

logreg = LogisticRegression()
logreg.fit(X_train,Y_train)
Y_pred_logreg = logreg.predict(X_test)
acc_log = round(logreg.score(X_train,Y_train)*100,2)

# Y_pred_logreg # 预测结果
acc_log # 准确率

计算相关性

coeff_df = pd.DataFrame(train_df.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])

coeff_df.sort_values(by='Correlation',ascending=False)

Support Vector Machines 支持向量机模型

svc = SVC()
svc.fit(X_train,Y_train)
Y_pred_svc = svc.predict(X_test)
acc_svc = round(svc.score(X_train,Y_train)*100,2)
# Y_pred_svc
acc_svc

KNN k近邻分类模型

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train,Y_train)
Y_pred_knn = knn.predict(X_test)
acc_knn = round(knn.score(X_train,Y_train)*100,2)
# Y_pred_knn
acc_knn

Gaussian Naive Bayes 贝叶斯分类算法

gaussian = GaussianNB()
gaussian.fit(X_train,Y_train)
Y_pred_gaussian = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train,Y_train)*100,2)
print(Y_pred_gaussian)
acc_gaussian

Perceptron 模型

perceptron = Perceptron()
perceptron.fit(X_train,Y_train)
Y_pred_perceptron = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train,Y_train)*100,2)
acc_perceptron

Linear SVC

linear_svc = LinearSVC()
linear_svc.fit(X_train,Y_train)
Y_pred_linear_svc = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train,Y_train)*100,2)
print(Y_pred_linear_svc)
acc_linear_svc

Stochasticc Gradient Descent 随机梯度下降

sgd = SGDClassifier()
sgd.fit(X_train,Y_train)
Y_pred_sgd = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train,Y_train)*100,2)
print(Y_pred_sgd)
acc_sgd

Decision Tree 决策树

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train,Y_train)
Y_pred_decision_tree = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train,Y_train)*100,2)
print(Y_pred_decision_tree)
acc_decision_tree

随机森林算法

from sklearn.model_selection import train_test_split
X_all = train_df.drop(['Survived'],axis=1)
y_all = train_df['Survived']

num_test = 0.20
X_train,X_test,y_train,y_test = train_test_split(X_all,y_all,test_size=num_test,random_state=23)

# Random Forest
from sklearn.metrics import make_scorer,accuracy_score
from sklearn.model_selection import GridSearchCV
random_forest = RandomForestClassifier()

parameters = {'n_estimators':[4,6,9],
              'max_features':['log2','sqrt','auto'],
              'criterion':['entropy','gini'],
              'max_depth':[2,3,5,10],
              'min_samples_split':[2,3,5],
              'min_samples_leaf':[1,5,8]
             }
acc_scorer = make_scorer(accuracy_score)
grid_obj = GridSearchCV(random_forest,parameters,scoring=acc_scorer)
grid_obj = grid_obj.fit(X_train,y_train)
clf = grid_obj.best_estimator_
clf.fit(X_train,y_train)
pred = clf.predict(X_test)
acc_random_forest_split = accuracy_score(y_test,pred)
print(pred)
acc_random_forest_split

k-fold 交叉验证模型

from sklearn.model_selection import KFold 

def run_kfold(clf):
    kf = KFold(n_splits=10, random_state=233, shuffle=True)
    outcomes = []
    fold = 0
    for train_index,test_index in kf.split(train_df):
        fold += 1
        X_train,X_test = X_all.values[train_index],X_all.values[test_index]
        y_train,y_test = y_all.values[train_index],y_all.values[test_index]
        clf.fit(X_train,y_train)
        predictions = clf.predict(X_test)
        accuracy = accuracy_score(y_test,predictions)
        outcomes.append(accuracy)
    mean_outcome = np.mean(outcomes)
    print("Mean Accuracy:{0}".format(mean_outcome))

run_kfold(clf)

最终结果

模型比较
Y_pred_random_forest_split = clf.predict(test_df.drop("PassengerId",axis=1))

models = pd.DataFrame({
    'Model':['SVM','KNN','Logistic Regression','Random Forest',
             'Naive Bayes','Perceptron','SGD','Linear SVC','Decision Tree'],
    'Score':[acc_svc,acc_knn,acc_log,acc_random_forest_split,
             acc_gaussian,acc_perceptron    ,acc_sgd,acc_linear_svc,acc_decision_tree]
})
M_s = models.sort_values(by='Score',ascending=False)
M_s

做个柱状图

plt.figure(figsize=(20,8),dpi=80)
plt.bar(M_s['Model'],M_s['Score'])
plt.show()
保存结果
# 导入时间模块,利用时间戳作为文件名
import time

# 最后取随机森林模型的预测结果进行提交
tim = time.strftime('%Y%m%d%H%M',time.localtime(time.time()))
submission = pd.DataFrame({
    "PassengerId":test_df['PassengerId'],
    "Survived":Y_pred_random_forest_split
})
submission.to_csv('C:/Users/12645/Kaggle/submission_random_forest_' + tim + '.csv',index=False)

#保存决策树模型预测结果
tim = time.strftime('%Y%m%d%H%M',time.localtime(time.time()))
submission = pd.DataFrame({
    "PassengerId":test_df['PassengerId'],
    "Survived":Y_pred_decision_tree
})
submission.to_csv('C:/Users/12645/Kaggle/submission_decision_tree_' + tim + '.csv',index=False)

#保存KNN模型预测结果
tim = time.strftime('%Y%m%d%H%M',time.localtime(time.time()))
submission = pd.DataFrame({
    "PassengerId":test_df['PassengerId'],
    "Survived":Y_pred_knn
})
submission.to_csv('C:/Users/12645/Kaggle/submission_knn_' + tim + '.csv',index=False)

#保存SVC模型预测结果
tim = time.strftime('%Y%m%d%H%M',time.localtime(time.time()))
submission = pd.DataFrame({
    "PassengerId":test_df['PassengerId'],
    "Survived":Y_pred_svc
})
submission.to_csv('C:/Users/12645/Kaggle/submission_svc_' + tim + '.csv',index=False)

#保存SGD模型预测结果
tim = time.strftime('%Y%m%d%H%M',time.localtime(time.time()))
submission = pd.DataFrame({
    "PassengerId":test_df['PassengerId'],
    "Survived":Y_pred_sgd
})
submission.to_csv('C:/Users/12645/Kaggle/submission_sgd_' + tim + '.csv',index=False)

#保存Linear SVC模型预测结果
tim = time.strftime('%Y%m%d%H%M',time.localtime(time.time()))
submission = pd.DataFrame({
    "PassengerId":test_df['PassengerId'],
    "Survived":Y_pred_linear_svc
})
submission.to_csv('C:/Users/12645/Kaggle/submission_linear_svc_' + tim + '.csv',index=False)

#保存逻辑回归模型预测结果
tim = time.strftime('%Y%m%d%H%M',time.localtime(time.time()))
submission = pd.DataFrame({
    "PassengerId":test_df['PassengerId'],
    "Survived":Y_pred_logreg
})
submission.to_csv('C:/Users/12645/Kaggle/submission_logreg_' + tim + '.csv',index=False)
提交结果

0.74401

### Kaggle Machine Learning Datasets and Tutorials Kaggle is a platform that provides an extensive collection of datasets, kernels (notebooks), and competitions to help individuals learn about data science and machine learning[^1]. The following sections outline the resources available on Kaggle related to machine learning. #### Datasets Kaggle hosts numerous datasets covering various domains such as healthcare, finance, social media analysis, etc. These datasets are curated by both organizations and individual contributors. Users can download these datasets directly from the website or use APIs provided by Kaggle for programmatic access[^2]. For example, one popular dataset often used in beginner-level projects includes Titanic: Machine Learning from Disaster where participants predict survival outcomes based on passenger information like age, gender, class, fare paid among others. #### Tutorials & Kernels Tutorials come under two categories - guided courses offered through partnership with experts which require registration but offer certification upon completion; secondly there exist community-contributed notebooks known as 'kernels'. Guided Courses cover topics ranging from introductory Python programming all way up advanced neural networks while Community Notebooks provide practical examples demonstrating how specific algorithms work using real-world problems alongside code snippets written primarily either R or python language depending user preference: Here’s a simple illustration showing logistic regression implementation within Jupyter Notebook environment utilizing Scikit-Learn library over Iris flower classification problem: ```python from sklearn.datasets import load_iris from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split import numpy as np # Load iris dataset data = load_iris() X_train, X_test, y_train, y_test = train_test_split(data.data, data.target) clf = LogisticRegression(random_state=0).fit(X_train, y_train) print(f'Accuracy Score:{np.round(clf.score(X_test,y_test)*100)}%') ``` This script demonstrates loading IRIS sample set into memory then splitting it randomly between training/testing groups before applying standard binary classifier algorithm called Logit Regression finally printing out accuracy percentage achieved during evaluation phase against unseen test cases not part original teaching material given earlier stages process pipeline execution flow sequence order steps taken here shown above clearly explained manner easy understand follow along practice try yourself home computer system setup ready go start experimenting immediately once installed necessary software packages required run successfully without errors encountered runtime exceptions thrown unexpected situations arise need troubleshooting resolve quickly efficiently move forward continue learning journey path success achieve goals aspirations dreams become reality true!
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

RWLinno

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值