Exploratory Data Analysis (Example 1)

博客介绍了使用Python进行数据处理,将训练集和测试集合并为一个数据框进行特征工程。包括数据清洗,如移除常量特征、重复特征,确定特征类型,处理种类特征和时间戳等操作,还提及了部分代码实现思路。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

concatenate train and test into one dataframe and do all feature engineering using it.

traintest = pd.concat([train, test], axis = 0)

Dataset cleaning

Remove constant features

如果某一特征的值时一样的,就没有意义。

. nunique() ??

不同值的个数,如果是1,则表示只有个常数值。

# `dropna = False` makes nunique treat NaNs as a distinct value
feats_counts = train.nunique(dropna = False)

打印几个看看

feats_counts.sort_values()[:10]
VAR_0213    1
VAR_0207    1
VAR_0840    1
VAR_0847    1
VAR_1428    1
VAR_1165    2
VAR_0438    2
VAR_1164    2
VAR_1163    2
VAR_1162    2
dtype: int64

We found 5 constant features. Let’s remove them.

.loc[feats_counts==1].index.tolist() .drop(constant_features,axis = 1,inplace=True)

constant_features = feats_counts.loc[feats_counts==1].index.tolist()
print (constant_features)


traintest.drop(constant_features,axis = 1,inplace=True)

Remove duplicated features

Fill NaNs with something we can find later if needed.

  traintest.fillna('NaN', inplace=True)

部分代码缺失,大约步骤就是先给每一个coloum进行编码,然后生成空的dup_cols = {} 字典。两层for循环,逐个相互比较,如果一样就将其放入字典中。

dup_cols = {}

for i, c1 in enumerate(tqdm_notebook(train_enc.columns)):
    for c2 in train_enc.columns[i + 1:]:
        if c2 not in dup_cols and np.all(train_enc[c1] == train_enc[c2]):
            dup_cols[c2] = c1

再drop

traintest.drop(dup_cols.keys(), axis = 1,inplace=True)

Determine types

不同值的个数大

nunique = train.nunique(dropna=False)
mask = (nunique.astype(float)/train.shape[0] > 0.8)
train.loc[:, mask]
mask = (nunique.astype(float)/train.shape[0] < 0.8) & (nunique.astype(float)/train.shape[0] > 0.4)
train.loc[:25, mask]

打印一部分看了,发现,虽然值很大,但都是整数,不是float。
there are no floating point variables, there are some counts variables, which we will treat as numeric.
随机选择一个看一下不同值的个数统计

train['VAR_0015'].value_counts()
 0.0      102382
 1.0       28045
 2.0        8981
 3.0        3199
 4.0        1274
 5.0         588
 6.0         275
 7.0         166
 8.0          97

将 object的列转换成list

cat_cols = list(train.select_dtypes(include=['object']).columns)

Go through???

替换掉Nans

train.replace('NaN', -999, inplace=True)

会用到的三个函数

def autolabel(arrayA):
    ''' label each colored square with the corresponding data value. 
    If value > 20, the text is in black, else in white.
    '''
    arrayA = np.array(arrayA)
    for i in range(arrayA.shape[0]):
        for j in range(arrayA.shape[1]):
                plt.text(j,i, "%.2f"%arrayA[i,j], ha='center', va='bottom',color='w')

def hist_it(feat):
    plt.figure(figsize=(16,4))
    feat[Y==0].hist(bins=range(int(feat.min()),int(feat.max()+2)),normed=True,alpha=0.8)
    feat[Y==1].hist(bins=range(int(feat.min()),int(feat.max()+2)),normed=True,alpha=0.5)
    plt.ylim((0,1))
    
def gt_matrix(feats,sz=16):
    a = []
    for i,c1 in enumerate(feats):
        b = [] 
        for j,c2 in enumerate(feats):
            mask = (~train[c1].isnull()) & (~train[c2].isnull())
            if i>=j:
                b.append((train.loc[mask,c1].values>=train.loc[mask,c2].values).mean())
            else:
                b.append((train.loc[mask,c1].values>train.loc[mask,c2].values).mean())

        a.append(b)

    plt.figure(figsize = (sz,sz))
    plt.imshow(a, interpolation = 'None')
    _ = plt.xticks(range(len(feats)),feats,rotation = 90)
    _ = plt.yticks(range(len(feats)),feats,rotation = 0)
    autolabel(a)

一个特征比另一个特征大的次数

# select first 42 numeric features
feats = num_cols[:42]

# build 'mean(feat1 > feat2)' plot
gt_matrix(feats,16)

种类特征

时间戳处理

找出时间戳的列 data_cols

for c in date_cols:
    train[c] = pd.to_datetime(train[c],format = '%d%b%y:%H:%M:%S')
    test[c] = pd.to_datetime(test[c],  format = '%d%b%y:%H:%M:%S')
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值