concatenate train and test into one dataframe and do all feature engineering using it.
traintest = pd.concat([train, test], axis = 0)
Dataset cleaning
Remove constant features
如果某一特征的值时一样的,就没有意义。
. nunique() ??
不同值的个数,如果是1,则表示只有个常数值。
# `dropna = False` makes nunique treat NaNs as a distinct value
feats_counts = train.nunique(dropna = False)
打印几个看看
feats_counts.sort_values()[:10]
VAR_0213 1
VAR_0207 1
VAR_0840 1
VAR_0847 1
VAR_1428 1
VAR_1165 2
VAR_0438 2
VAR_1164 2
VAR_1163 2
VAR_1162 2
dtype: int64
We found 5 constant features. Let’s remove them.
.loc[feats_counts==1].index.tolist() .drop(constant_features,axis = 1,inplace=True)
constant_features = feats_counts.loc[feats_counts==1].index.tolist()
print (constant_features)
traintest.drop(constant_features,axis = 1,inplace=True)
Remove duplicated features
Fill NaNs with something we can find later if needed.
traintest.fillna('NaN', inplace=True)
部分代码缺失,大约步骤就是先给每一个coloum进行编码,然后生成空的dup_cols = {} 字典。两层for循环,逐个相互比较,如果一样就将其放入字典中。
dup_cols = {}
for i, c1 in enumerate(tqdm_notebook(train_enc.columns)):
for c2 in train_enc.columns[i + 1:]:
if c2 not in dup_cols and np.all(train_enc[c1] == train_enc[c2]):
dup_cols[c2] = c1
再drop
traintest.drop(dup_cols.keys(), axis = 1,inplace=True)
Determine types
不同值的个数大
nunique = train.nunique(dropna=False)
mask = (nunique.astype(float)/train.shape[0] > 0.8)
train.loc[:, mask]
mask = (nunique.astype(float)/train.shape[0] < 0.8) & (nunique.astype(float)/train.shape[0] > 0.4)
train.loc[:25, mask]
打印一部分看了,发现,虽然值很大,但都是整数,不是float。
there are no floating point variables, there are some counts variables, which we will treat as numeric.
随机选择一个看一下不同值的个数统计
train['VAR_0015'].value_counts()
0.0 102382
1.0 28045
2.0 8981
3.0 3199
4.0 1274
5.0 588
6.0 275
7.0 166
8.0 97
将 object的列转换成list
cat_cols = list(train.select_dtypes(include=['object']).columns)
Go through???
替换掉Nans
train.replace('NaN', -999, inplace=True)
会用到的三个函数
def autolabel(arrayA):
''' label each colored square with the corresponding data value.
If value > 20, the text is in black, else in white.
'''
arrayA = np.array(arrayA)
for i in range(arrayA.shape[0]):
for j in range(arrayA.shape[1]):
plt.text(j,i, "%.2f"%arrayA[i,j], ha='center', va='bottom',color='w')
def hist_it(feat):
plt.figure(figsize=(16,4))
feat[Y==0].hist(bins=range(int(feat.min()),int(feat.max()+2)),normed=True,alpha=0.8)
feat[Y==1].hist(bins=range(int(feat.min()),int(feat.max()+2)),normed=True,alpha=0.5)
plt.ylim((0,1))
def gt_matrix(feats,sz=16):
a = []
for i,c1 in enumerate(feats):
b = []
for j,c2 in enumerate(feats):
mask = (~train[c1].isnull()) & (~train[c2].isnull())
if i>=j:
b.append((train.loc[mask,c1].values>=train.loc[mask,c2].values).mean())
else:
b.append((train.loc[mask,c1].values>train.loc[mask,c2].values).mean())
a.append(b)
plt.figure(figsize = (sz,sz))
plt.imshow(a, interpolation = 'None')
_ = plt.xticks(range(len(feats)),feats,rotation = 90)
_ = plt.yticks(range(len(feats)),feats,rotation = 0)
autolabel(a)
一个特征比另一个特征大的次数
# select first 42 numeric features
feats = num_cols[:42]
# build 'mean(feat1 > feat2)' plot
gt_matrix(feats,16)
种类特征
时间戳处理
找出时间戳的列 data_cols
for c in date_cols:
train[c] = pd.to_datetime(train[c],format = '%d%b%y:%H:%M:%S')
test[c] = pd.to_datetime(test[c], format = '%d%b%y:%H:%M:%S')