Porto Seguro’s Safe Driver Prediction
4.特征工程
上一部分完成了对数据的清洗与分析工作,完成这些准备工作之后,接下来准备进行特征工程。特征工程包括对原始数据特征进行检测、变形、筛选,以及构筑新的可能对建立模型有帮助的特征,是机器学习的重要步骤。
首先,读入必要的数据与模块:
import pandas as pd
import numpy as np
test = pd.read_csv('D:\\CZM\\Kaggle\\FinalWork\\train_after_EDA.csv')
train = pd.read_csv('D:\\CZM\\Kaggle\\FinalWork\\test_after_EDA.csv')
len_train = len(train)
len_test = len(test)
4.1 基于缺失值的特征工程
在对缺失值进行分析时,我们发现一部分特征的缺失情况之间存在关联,如下图所示:
2.2已经分析过,ps_ind_02_cat与ps_ind_04_cat、ps_car_01_cat;ps_car_07_cat和ps_ind_05_cat;ps_03_cat和ps_05_cat,这三组特征每组内的特征缺失情况彼此之间存在关联。基于此,尝试构筑以下三个特征:
MissingTotal_1:样本在ps_ind_02_cat、ps_ind_04_cat、ps_car_01_cat这三个特征出现缺失值的个数之和。
MissingTotal_2:样本在ps_car_07_cat和ps_ind_05_cat这两个特征中出现缺失值的个数之和。
MissingTotal_3:样本在ps_car_03_cat和ps_car_05_cat这两个特征中出现缺失值的个数之和。
代码如下:
#基于缺失值构筑特征:
train['MissingTotal_1'] = np.zeros(len_train)
train['MissingTotal_2'] = np.zeros(len_train)
train['MissingTotal_3'] = np.zeros(len_train)
##MissingTotal_1:ps_ind_02_cat、ps_ind_04_cat、ps_ind_04_cat
train['MissingTotal_1'][train.ps_ind_02_cat==-1] += 1
train['MissingTotal_1'][train.ps_ind_04_cat==-1] += 1
train['MissingTotal_1'][train.ps_car_01_cat==-1] += 1
##MissingTotal_2:ps_car_07_cat和ps_ind_05_cat
train['MissingTotal_2'][train.ps_car_07_cat==-1] += 1
train['MissingTotal_2'][train.ps_ind_05_cat==-1] += 1
##MissingTotal_3:ps_car_03_cat和ps_car_05_cat
train['MissingTotal_3'][train.ps_car_03_cat==-1] += 1
train['MissingTotal_3'][train.ps_car_05_cat==-1] += 1
##对测试集进行相同的处理:
test['MissingTotal_1'] = np.zeros(len_test)
test['MissingTotal_2'] = np.zeros(len_test)
test['MissingTotal_3'] = np.zeros(len_test)
##MissingTotal_1:ps_ind_02_cat、ps_ind_04_cat、ps_ind_04_cat
test['MissingTotal_1'][test.ps_ind_02_cat==-1] += 1
test['MissingTotal_1'][test.ps_ind_04_cat==-1] += 1
test['MissingTotal_1'][test.ps_car_01_cat==-1] += 1
##MissingTotal_2:ps_car_07_cat和ps_ind_05_cat
test['MissingTotal_2'][test.ps_car_07_cat==-1] += 1
test['MissingTotal_2'][test.ps_ind_05_cat==-1] += 1
##MissingTotal_3:ps_car_03_cat和ps_car_05_cat
test['MissingTotal_3'][test.ps_car_03_cat==-1] += 1
test['MissingTotal_3'][test.ps_car_05_cat==-1] += 1
接下来,我们要分析新增的变量与Target变量的关系:
##作图观察三个特征与Target变量的关系:
k = 0
plt.figure(figsize=(20,10))
for x in Missing_Total:
k = k+1
plt.subplot(1,3,k)
names = [];prop = []
va = train[x].value_counts().index
for name in va:
names.append('feature:'+str(name))
props = train[train[x]==name].target.value_counts()
prop_1 = 0;prop_0 = 0
if 1 in props.index:prop_1 = props[1]
if 0 in props.index:prop_0 = props[0]
prop.append(prop_1/(prop_1+prop_0))
plt.ylabel('Proportion Of target