系列文章目录
分类问题学习笔记——KNN
分类问题学习笔记——决策树
预测泰坦尼克号上的生存状况,熟悉ML基础知识
1.下载数据集,了解字段含义
首先去比赛界面下载数据集https://www.kaggle.com/c/titanic/data
import pandas as pd
import numpy as np
import random as rnd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
train = pd.read_csv('../data/titanic/train.csv')
test = pd.read_csv('../data/titanic/test.csv')
print('Train data shape:',train.shape)
print('Test data shape:',test.shape)
Train data shape: (891, 12)
Test data shape: (418, 11)
# 首先了解字段的含义并且可以看出Age、Cabin、Embarked三个字段是存在缺失值的情况
# survival Survival 0 = No, 1 = Yes 我们的label
# pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd 仓位等级
# sex Sex
# Age Age in years
# sibsp # of siblings / spouses aboard the Titanic 是否是带配偶来的
# parch # of parents / children aboard the Titanic 带了几个孩子来的
# ticket Ticket number 船票号码
# fare Passenger fare 船票票价
# cabin Cabin number 船舱号码
# embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton 出发港
train.info()
""" 结果如下:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
"""
test.info()
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 418 non-null int64
1 Pclass 418 non-null int64
2 Name 418 non-null object
3 Sex 418 non-null object
4 Age 332 non-null float64
5 SibSp 418 non-null int64
6 Parch 418 non-null int64
7 Ticket 418 non-null object
8 Fare 417 non-null float64
9 Cabin 91 non-null object
10 Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB"""
可以看出训练数据train.csv 缺失值字段:Age、Cabin、Embarked
测试数据 缺失值字段:Age、Cabin、Fare
train.describe() # 看一看数值标签的样本特征分布
train.describe(include=['O']) # 分类类型标签的样本特征分布
train.head(3).append(train.tail(3))
# 可视化看一下缺失值字段
missing = train.isnull().sum()/len(train)
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()
至此第一步的观察样本,我们大致可以得出一些信息:缺失值字段:Age、Cabin、Embarked、Fare