Exploratory Data Analysis (Example 1)

最新推荐文章于 2025-06-17 10:17:20 发布

weixin_41376658

最新推荐文章于 2025-06-17 10:17:20 发布

阅读量383

点赞数 1

CC 4.0 BY-SA版权

分类专栏： python入门与四大科学库特征工程数据预处理文章标签： python

本文链接：https://blog.youkuaiyun.com/weixin_41376658/article/details/95969084

python入门与四大科学库同时被 3 个专栏收录

6 篇文章

订阅专栏

特征工程

1 篇文章

订阅专栏

数据预处理

1 篇文章

订阅专栏

博客介绍了使用Python进行数据处理，将训练集和测试集合并为一个数据框进行特征工程。包括数据清洗，如移除常量特征、重复特征，确定特征类型，处理种类特征和时间戳等操作，还提及了部分代码实现思路。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

concatenate train and test into one dataframe and do all feature engineering using it.

traintest = pd.concat([train, test], axis = 0)

Dataset cleaning

Remove constant features

如果某一特征的值时一样的，就没有意义。

. nunique() ??

不同值的个数，如果是1，则表示只有个常数值。

# `dropna = False` makes nunique treat NaNs as a distinct value
feats_counts = train.nunique(dropna = False)

打印几个看看

feats_counts.sort_values()[:10]

VAR_0213    1
VAR_0207    1
VAR_0840    1
VAR_0847    1
VAR_1428    1
VAR_1165    2
VAR_0438    2
VAR_1164    2
VAR_1163    2
VAR_1162    2
dtype: int64

We found 5 constant features. Let’s remove them.

.loc[feats_counts==1].index.tolist() .drop(constant_features,axis = 1,inplace=True)

constant_features = feats_counts.loc[feats_counts==1].index.tolist()
print (constant_features)


traintest.drop(constant_features,axis = 1,inplace=True)

Remove duplicated features

Fill NaNs with something we can find later if needed.

  traintest.fillna('NaN', inplace=True)

部分代码缺失，大约步骤就是先给每一个coloum进行编码，然后生成空的dup_cols = {} 字典。两层for循环，逐个相互比较，如果一样就将其放入字典中。

dup_cols = {}

for i, c1 in enumerate(tqdm_notebook(train_enc.columns)):
    for c2 in train_enc.columns[i + 1:]:
        if c2 not in dup_cols and np.all(train_enc[c1] == train_enc[c2]):
            dup_cols[c2] = c1

再drop

traintest.drop(dup_cols.keys(), axis = 1,inplace=True)

Determine types

不同值的个数大

nunique = train.nunique(dropna=False)
mask = (nunique.astype(float)/train.shape[0] > 0.8)
train.loc[:, mask]

mask = (nunique.astype(float)/train.shape[0] < 0.8) & (nunique.astype(float)/train.shape[0] > 0.4)
train.loc[:25, mask]

打印一部分看了，发现，虽然值很大，但都是整数，不是float。
there are no floating point variables, there are some counts variables, which we will treat as numeric.
随机选择一个看一下不同值的个数统计

train['VAR_0015'].value_counts()

 0.0      102382
 1.0       28045
 2.0        8981
 3.0        3199
 4.0        1274
 5.0         588
 6.0         275
 7.0         166
 8.0          97

将 object的列转换成list

cat_cols = list(train.select_dtypes(include=['object']).columns)

Go through？？？

替换掉Nans

train.replace('NaN', -999, inplace=True)

会用到的三个函数

def autolabel(arrayA):
    ''' label each colored square with the corresponding data value. 
    If value > 20, the text is in black, else in white.
    '''
    arrayA = np.array(arrayA)
    for i in range(arrayA.shape[0]):
        for j in range(arrayA.shape[1]):
                plt.text(j,i, "%.2f"%arrayA[i,j], ha='center', va='bottom',color='w')

def hist_it(feat):
    plt.figure(figsize=(16,4))
    feat[Y==0].hist(bins=range(int(feat.min()),int(feat.max()+2)),normed=True,alpha=0.8)
    feat[Y==1].hist(bins=range(int(feat.min()),int(feat.max()+2)),normed=True,alpha=0.5)
    plt.ylim((0,1))
    
def gt_matrix(feats,sz=16):
    a = []
    for i,c1 in enumerate(feats):
        b = [] 
        for j,c2 in enumerate(feats):
            mask = (~train[c1].isnull()) & (~train[c2].isnull())
            if i>=j:
                b.append((train.loc[mask,c1].values>=train.loc[mask,c2].values).mean())
            else:
                b.append((train.loc[mask,c1].values>train.loc[mask,c2].values).mean())

        a.append(b)

    plt.figure(figsize = (sz,sz))
    plt.imshow(a, interpolation = 'None')
    _ = plt.xticks(range(len(feats)),feats,rotation = 90)
    _ = plt.yticks(range(len(feats)),feats,rotation = 0)
    autolabel(a)

一个特征比另一个特征大的次数

# select first 42 numeric features
feats = num_cols[:42]

# build 'mean(feat1 > feat2)' plot
gt_matrix(feats,16)

种类特征

时间戳处理

找出时间戳的列 data_cols

for c in date_cols:
    train[c] = pd.to_datetime(train[c],format = '%d%b%y:%H:%M:%S')
    test[c] = pd.to_datetime(test[c],  format = '%d%b%y:%H:%M:%S')