kaggle入门学习demo——Titanic: Machine Learning from Disaster

系列文章目录
分类问题学习笔记——KNN
分类问题学习笔记——决策树



预测泰坦尼克号上的生存状况,熟悉ML基础知识

1.下载数据集,了解字段含义

首先去比赛界面下载数据集https://www.kaggle.com/c/titanic/data

import pandas as pd
import numpy as np
import random as rnd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')


train = pd.read_csv('../data/titanic/train.csv')
test = pd.read_csv('../data/titanic/test.csv')
print('Train data shape:',train.shape)
print('Test data shape:',test.shape)

Train data shape: (891, 12)
Test data shape: (418, 11)

# 首先了解字段的含义并且可以看出Age、Cabin、Embarked三个字段是存在缺失值的情况
# survival	Survival	0 = No, 1 = Yes 我们的label 
# pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd 仓位等级
# sex	Sex	
# Age	Age in years	
# sibsp	# of siblings / spouses aboard the Titanic	 是否是带配偶来的
# parch	# of parents / children aboard the Titanic	 带了几个孩子来的
# ticket	Ticket number	 船票号码
# fare	Passenger fare	   船票票价
# cabin	Cabin number	   船舱号码
# embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton 出发港
train.info()


""" 结果如下:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
"""
test.info()
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB"""

可以看出训练数据train.csv 缺失值字段:Age、Cabin、Embarked
测试数据 缺失值字段:Age、Cabin、Fare

train.describe() # 看一看数值标签的样本特征分布

在这里插入图片描述

train.describe(include=['O']) # 分类类型标签的样本特征分布

在这里插入图片描述

train.head(3).append(train.tail(3))

在这里插入图片描述

# 可视化看一下缺失值字段
missing = train.isnull().sum()/len(train)
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()

在这里插入图片描述

至此第一步的观察样本,我们大致可以得出一些信息:缺失值字段:Age、Cabin、Embarked、Fare

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值