零基础入门数据挖掘-Task2 数据分析

import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# 将缺失信息进行可视化
import missingno as msno

# 过滤掉警告列表中已存在的警告
warnings.filterwarnings('ignore')

3. 载入数据

# 注意数据是由空格分隔开的
train_df = pd.read_csv(r'./data/train.csv', sep = ' ')
test_df = pd.read_csv(r'./data/testA.csv', sep = ' ')

4. 数据总览

4.1 查看数据规模

pandas的表格和numpy数组都可以使用shape()方法来查看数据的总体规模。

print(train_df.shape)

输出结果为：

(150000, 31)

4.2 查看数据样例

pandas的表格可以使用head()方法和tail()方法来查看整个数据的前几行和后几行的数据。参数可以填写具体的数字来表示具体要显示多少行，默认为5。

这里我们将表格的前5行和后5行合并在一起进行显示。另外，考虑到数据特征比较多，所以先将表格按列切分然后再进行显示。具体代码如下：

train_df.iloc[:, : 11].head().append(train_df.iloc[:, : 11].tail())

train_df.iloc[:, 11: 22].head().append(train_df.iloc[:, 11: 22].tail())

train_df.iloc[:, 22: ].head().append(train_df.iloc[:, 22: ].tail())

显示的前11列结果如下：

4.3 查看数据统计和数据类型

pandas表格中查看数据统计的方法是describe()，查看数据类型的方法是info()。

train_df.describe()

train_df.info()

5. 判断和处理缺失值与异常值

5.1 查看缺失值统计结果

pandas表格中的isnull()方法会以表格的方式显示那些位置上的数据为空值。但是这样并不利于统计。不过，在得到的表格后再使用sum()方法就可以了。

train_df.isnull().sum()

结果如下：

SaleID                  0
name                    0
regDate                 0
model                   1
brand                   0
bodyType             4506
fuelType             8680
gearbox              5981
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
price                   0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64

5.2 将缺失值情况进行可视化

通过对缺失值的可视化可以很直观的了解哪些列存在空值。如果空值可数较少一般选择填充。如果使用lgb等树模型可以直接空缺，让树自己去优化。但如果空值存在的过多则可以考虑删掉。

5.2.1 利用pandas自带的可视化方法

简单来说就是根据上面缺失值的统计结果生成图像。

missing = train_df.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(ascending = True, inplace = True)
missing.plot.bar()
plt.show()

5.2.2 利用missingno

missingno是一个专门用来将缺失值进行可视化的库。这个库有很多种方式对数据的缺失值情况进行可视化描述。具体可以参考[1]。这里我们用分别采用矩阵和条形图的方式来显示缺失值的情况。

msno.matrix(train_df.sample(100))

msno.bar(train_df.sample(100))

5.3 查看并转换不规范的异常值

pandas中有一种数据类型叫做object。实际上这是一种字符串类型。在我们的数据里面只有一个特征是object类型，就是notRepairedDamage。有时候，一些object类型的空值是用某种特殊字符来替代的。因此我们需要观察这种特征数据中是否存在这种特殊字符，并把它转化成空值。

对于pandas序列，可以通过value_counts()方法来统计该序列中有哪些不同的值以及这些值出现了多少次。

train_df['notRepairedDamage'].value_counts()

结果如下：

0.0    111361
-       24324
1.0     14315
Name: notRepairedDamage, dtype: int64

另外，collections库里面的Counter模块也可以实现类似的功能。这里不多做介绍。

从输出结果来看，'-'字符实际上也是代表了一种空值。因此，需要将其进行转换，变为标准的空值。这里采用pandas序列的replace()方法。

train_df['notRepairedDamage'].replace('-', np.nan, inplace = True)

6. 了解预测值的分布并进行调整

6.1 观察预测值服从哪种分布

这部分需要导入scipy.stats模块。它提供的各种概率分布模型，这些模型在后面的绘图上会提供帮助。

import scipy.stats as st

y = train_df['price']

plt.figure(1)
plt.title('Johnson SU')
sns.distplot(y, kde = False, fit = st.johnsonsu)

plt.figure(2)
plt.title('Normal')
# kde是程序自己计算出的和密度函数
# fit是引入别的分布函数进行拟合
# 二者越接近说明数据越有可能服从该分布
sns.distplot(y, kde = True, fit = st.norm)

plt.figure(3)
plt.title('Log Normal')
sns.distplot(y, kde = False, fit = st.lognorm)

观察这三幅图可以发现价格不服从正太分布。因此需要对其进行转换，使其服从正态分布。至于为什么一定要转换成正态分布，以及如何进行转换，可以参考[2]。另外，关于seaborn的绘图方法以及各类图示的含义可以参考[3]。

6.2 观察预测值的偏度和峰度

偏度：描述数据分布形态的统计量，简单来说就是数据的不对称程度。左尾长为负偏，右尾长为正偏。

峰度：描述数据分布峰部的尖度。越大越尖，越小越扁，为0等同于正太分布。

# 偏度
print("Skewness: %f" % train_df['price'].skew())
# 峰度
print("Kurtosis: %f" % train_df['price'].kurt())

结果如下：

Skewness: 3.346487
Kurtosis: 18.995183

6.3 观察预测值的频数并使数据服从正态分布

相对于观察预测值的分布，观察频数相当于是在较粗的粒度下进行观察。

plt.hist(train_df['price'], orientation = 'vertical', histtype = 'bar', color = 'red')
plt.show()

我们尝试使用以e为底的对数来对预测值进行处理并再次查看效果。

plt.hist(np.log(train_df['price']), orientation = 'vertical', \
         histtype = 'bar', color = 'red') 
plt.show()

可以明显地发现此时的数据已经很接近正态分布了！

7. 了解特征值分布

7.1 查看各特征值的偏度和峰度

同6.2观察pandas序列的偏度和峰度，pandas的表格也可以使用skew()方法和kurt()方法来得到每一列数据的偏度和峰度。

print(train_df.skew())
print(train_df.kurt())

SaleID               6.017846e-17
name                 5.576058e-01
regDate              2.849508e-02
model                1.484388e+00
brand                1.150760e+00
bodyType             9.915299e-01
fuelType             1.595486e+00
gearbox              1.317514e+00
power                6.586318e+01
kilometer           -1.525921e+00
notRepairedDamage    2.430640e+00
regionCode           6.888812e-01
creatDate           -7.901331e+01
price                3.346487e+00
v_0                 -1.316712e+00
v_1                  3.594543e-01
v_2                  4.842556e+00
v_3                  1.062920e-01
v_4                  3.679890e-01
v_5                 -4.737094e+00
v_6                  3.680730e-01
v_7                  5.130233e+00
v_8                  2.046133e-01
v_9                  4.195007e-01
v_10                 2.522046e-02
v_11                 3.029146e+00
v_12                 3.653576e-01
v_13                 2.679152e-01
v_14                -1.186355e+00
dtype: float64
SaleID                 -1.200000
name                   -1.039945
regDate                -0.697308
model                   1.740483
brand                   1.076201
bodyType                0.206937
fuelType                5.880049
gearbox                -0.264161
power                5733.451054
kilometer               1.141934
notRepairedDamage       3.908072
regionCode             -0.340832
creatDate            6881.080328
price                  18.995183
v_0                     3.993841
v_1                    -1.753017
v_2                    23.860591
v_3                    -0.418006
v_4                    -0.197295
v_5                    22.934081
v_6                    -1.742567
v_7                    25.845489
v_8                    -0.636225
v_9                    -0.321491
v_10                   -0.577935
v_11                   12.568731
v_12                    0.268937
v_13                   -0.438274
v_14                    2.393526
dtype: float64

7.2 类别特征分析

7.2.1 查看各个特征的nunique情况

pandas的序列的nunique()方法可以统计序列中每一个不同的值出现的次数。

categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', \
                        'gearbox', 'notRepairedDamage', 'regionCode',]

for cat_fea in categorical_features:
    print(cat_fea, train_df[cat_fea].nunique())

另外，pandas序列还有一个叫做unique()的方法可以统计序列中出现的所有不同的值。

还有，collections的Counter可以很好地将二者的功能结合起来使用。这里就不做介绍了。

7.2.2 类别特征箱型图可视化

箱型图课显示定量数据的分布情况。框显示数据集的四分位数，线显示分布的其余部分，它能显示出一组数据的最大值、最小值、中位数及上下四分位数，使用四分位数范围函数的方法可以确定“离群值”的点[3]。

在我们使用的数据中，name和regionCode的值分布的十分稀疏，因此没有必要也最好不要画出它们的箱型图（过于稀疏的数据绘制箱型图需要相当长的时间而且也没有意义）。

categorical_features = ['model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage']

for c in categorical_features:
    train_df[c] = train_df[c].astype('category')
    # 如果当前序列存在空值则为序列添加新类别MISSING表示空值
    if train_df[c].isnull().any():
        train_df[c] = train_df[c].cat.add_categories(['MISSING'])
        train_df[c] = train_df[c].fillna('MISSING')

def boxplot(x, y, **kwargs):
    sns.boxplot(x = x, y = y)
    x = plt.xticks(rotation = 90)

f = pd.melt(train_df, id_vars = ['price'], value_vars = categorical_features)
g = sns.FacetGrid(f, col = "variable", col_wrap = 2, sharex = False, \
                  sharey = False, size = 5)
g = g.map(boxplot, "value", "price")

7.2.2 类别特征小提琴图可视化

小提琴图与箱型图类似。不像箱形图中所有绘图组件都对应于实际数据点，小提琴绘图以基础分布的核密度估计为特征，通过小提琴图可以知道哪些位置的密度较高。在图中，白点是中位数，黑色盒型的范围是下四分位点到上四分位点，细黑线表示须。外部形状即为核密度估计[3]。

catg_list = categorical_features
target = 'price'
for catg in catg_list :
    sns.violinplot(x = catg, y = target, data = train_df)
    plt.show()

7.2.4 类别特征柱形图可视化

def bar_plot(x, y, **kwargs):
    sns.barplot(x = x, y = y)
    x = plt.xticks(rotation = 90)

f = pd.melt(train_df, id_vars=['price'], value_vars = categorical_features)
g = sns.FacetGrid(f, col = "variable",  col_wrap = 2, \
                  sharex = False, sharey = False, size = 5)
g = g.map(bar_plot, "value", "price")

7.2.5 类别特征频数可视化

def count_plot(x,  **kwargs):
    sns.countplot(x = x)
    x = plt.xticks(rotation = 90)

f = pd.melt(train_df,  value_vars = categorical_features)
g = sns.FacetGrid(f, col = "variable",  col_wrap = 2, sharex=False, \
                  sharey = False, size = 5)
g = g.map(count_plot, "value")

7.3 数值特征分析

7.3.1 相关性分析

数值型特征不同于类别型特征，其取值往往是有实际意义的，其大小会对模型的训练造成影响。因此，有必要分析数值型特征之间的相关性。这有助于后续特征工程的工作。

numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 
                    'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 
                    'v_14' ]

numeric_features.append('price')
price_numeric = train_df[numeric_features]
correlation = price_numeric.corr()
print(correlation['price'].sort_values(ascending = False), '\n')

结果如下：

price        1.000000
v_12         0.692823
v_8          0.685798
v_0          0.628397
power        0.219834
v_5          0.164317
v_2          0.085322
v_6          0.068970
v_1          0.060914
v_14         0.035911
v_13        -0.013993
v_7         -0.053024
v_4         -0.147085
v_9         -0.206205
v_10        -0.246175
v_11        -0.275320
kilometer   -0.440519
v_3         -0.730946
Name: price, dtype: float64

另外，还可以通过绘制热力图的方式来观察特征之间的相关性程度。

f ,ax = plt.subplots(figsize = (7, 7))

plt.title('Correlation of Numeric Features with Price', y = 1, size = 16)

sns.heatmap(correlation, square = True, vmax = 0.8)

7.3.2 数值特征分布可视化

f = pd.melt(train_df, value_vars = numeric_features)
g = sns.FacetGrid(f, col = "variable",  col_wrap = 2, sharex = False, sharey = False)
g = g.map(sns.distplot, "value")

7.3.3 数值特征之间的关系可视化

sns.set()
columns = ['price', 'v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', \
           'v_1', 'v_14']
sns.pairplot(train_df[columns], size = 2, kind = 'scatter', diag_kind = 'kde')
plt.show()

7.3.4 多变量相互回归关系可视化

Y_train = train_df['price']

fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = \
plt.subplots(nrows=5, ncols=2, figsize=(24, 20))
# ['v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
v_12_scatter_plot = pd.concat([Y_train, train_df['v_12']], axis = 1)
sns.regplot(x = 'v_12', y = 'price', data = v_12_scatter_plot, \
            scatter = True, fit_reg = True, ax = ax1)

v_8_scatter_plot = pd.concat([Y_train, train_df['v_8']], axis = 1)
sns.regplot(x = 'v_8', y = 'price',data = v_8_scatter_plot, \
            scatter = True, fit_reg = True, ax = ax2)

v_0_scatter_plot = pd.concat([Y_train, train_df['v_0']], axis = 1)
sns.regplot(x = 'v_0', y = 'price', data = v_0_scatter_plot, \
            scatter = True, fit_reg = True, ax = ax3)

power_scatter_plot = pd.concat([Y_train, train_df['power']], axis = 1)
sns.regplot(x = 'power', y = 'price', data = power_scatter_plot, \
            scatter = True, fit_reg = True, ax = ax4)

v_5_scatter_plot = pd.concat([Y_train, train_df['v_5']], axis = 1)
sns.regplot(x = 'v_5', y = 'price', data = v_5_scatter_plot, \
            scatter = True, fit_reg = True, ax = ax5)

v_2_scatter_plot = pd.concat([Y_train, train_df['v_2']], axis = 1)
sns.regplot(x = 'v_2', y = 'price',data = v_2_scatter_plot, \
            scatter = True, fit_reg = True, ax = ax6)

v_6_scatter_plot = pd.concat([Y_train, train_df['v_6']], axis = 1)
sns.regplot(x = 'v_6', y = 'price', data = v_6_scatter_plot, \
            scatter = True, fit_reg = True, ax = ax7)

v_1_scatter_plot = pd.concat([Y_train, train_df['v_1']], axis = 1)
sns.regplot(x = 'v_1', y = 'price', data = v_1_scatter_plot, 
            scatter = True, fit_reg = True, ax = ax8)

v_14_scatter_plot = pd.concat([Y_train, train_df['v_14']], axis = 1)
sns.regplot(x = 'v_14', y = 'price', data = v_14_scatter_plot, \
            scatter = True, fit_reg = True, ax = ax9)

v_13_scatter_plot = pd.concat([Y_train, train_df['v_13']], axis = 1)
sns.regplot(x = 'v_13', y = 'price', data = v_13_scatter_plot, \
            scatter = True, fit_reg = True, ax = ax10)

8. 用pandas_profiling生成数据报告

import pandas_profiling
pfr = pandas_profiling.ProfileReport(train_df)
pfr.to_file(r'./report.html')

这一部分我也遇到了同其他人一样的问题，程序会卡在45%。原因尚待考证。

9. 参考文献

1. https://blog.youkuaiyun.com/Andy_shenzl/article/details/81633356

2. https://blog.youkuaiyun.com/m0_37228052/article/details/89639426

3. https://blog.youkuaiyun.com/qq_40195360/article/details/86605860