前言
基于泰坦尼克数据进行决策树分析,考过特征值、目标值提取,样本数据分割,特征工程,模型训练等过程。
一、泰坦尼克数据介绍
数据主要根据年龄、性别、舱位等级等特征值预测是否生还。
PassengerId | Survived | Pclass | … | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|
1 | 0 | 3st | … | 7.2500 | NaN | S |
2 | 1 | 1st | … | 71.2833 | C85 | C |
3 | 1 | 3st | … | 7.9250 | NaN | S |
4 | 1 | 1st | … | 53.1000 | C123 | S |
5 | 0 | 3st | … | 8.0500 | NaN | S |
二、使用步骤
1.引入库
代码如下(示例):
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
2.读入数据
代码如下(示例):
data = pd.read_csv('E://titanic//train.csv')
data['Pclass'] = data['Pclass'].astype(str) + 'st'
3.提起特征值和目标值
x = data[['Pclass','Age','Sex']]
y = data['Survived']
4.处理异常值
x['Age'].fillna(value=data['Age'].mean(),inplace=True)
5.数据集分割
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=22,test_size=0.2)
x_train = x_train.to_dict(orient='records')
x_test = x_test.to_dict(orient='records')
6.特征工程,将类型值转化为ont-hot矩阵
transfer = DictVectorizer(sparse=False)
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)
7.特征训练与评分
estimator = DecisionTreeClassifier(max_depth=4)
estimator.fit(x_train,y_train)
y_pre = estimator.predict(x_test)
print(y_pre)
print(estimator.score(x_test,y_test))