【天池_二手车价格预测】Exploratory Data Analysis

最新推荐文章于 2025-03-28 20:00:54 发布

Johnny_sc

最新推荐文章于 2025-03-28 20:00:54 发布

阅读量629

点赞数 1

文章标签：大数据

本文链接：https://blog.youkuaiyun.com/Johnny_sc/article/details/105082346

版权

这篇博客介绍了对天池二手车价格预测数据集的探索性数据分析。作者观察到价格存在离群点，分析了日期、类别和数值特征，发现与价格相关性较高的特征包括汽车注册年份、行驶公里数及一些匿名特征。同时，注意到一些特征间存在冗余，并提出了特征工程和模型训练的初步思路。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

浏览了官方eda过程，觉得操作有点多，一时消化不掉
学习了天才儿童大佬的EDA

首先把训练集读进来简单看看各列的情况，主要看一下预测目标price的情况，发现均值在5900左右，标准差在7500左右，然而最大值居然有99999，可以看出事情不简单，回归题最怕存在离群点…

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)


train_df = pd.read_csv('G:/tianchi/dataMining_cars/used_car_train_20200313.csv', sep=' ')
print(train_df.shape)
train_df.describe()

直接画图看一下，发现跟正态分布相差有一点远，远处的离群点看起来还不少，训练起来误差估计会很大，这些离群点没办法准确预测，训练的时候可以考虑去掉，但如果测试集也有类似的点，那就没办法了，回归场景里面一个离群点带来的误差就能拖垮整个数据集上的指标分数。

import matplotlib.pyplot as plt
import seaborn as sns


plt.figure()
sns.distplot(train_df['price'])
plt.figure()
train_df['price'].plot.box()
plt.show()

把测试集读进来，看看全数据集的情况。

import gc


test_df = pd.read_csv('datalab/231784/used_car_testA_20200313.csv', sep=' ')
print(test_df.shape)
df = pd.concat([train_df, test_df], axis=0, ignore_index=True)
del train_df, test_df
gc.collect()
df.head()

把特征分成三部分，分别是日期特征、类别特征、数值特征。然后看看每一维特征的缺失率、n unique等信息，可以发现seller、offerType这两个特征可以删掉了，所有样本就一个取值，没什么用。从这里还可以发现匿名特征里面的v_0到v_4、v_10到v_14感觉长的有点像，貌似有很多相似的地方

date_cols = ['regDate', 'creatDate']
cate_cols = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode', 'seller', 'offerType']
num_cols = ['power', 'kilometer'] + ['v_{}'.format(i) for i in range(15)]
cols = date_cols + cate_cols + num_cols

tmp = pd.DataFrame()
tmp['count'] = df[cols].count().values
tmp['missing_rate'] = (df.shape[0] - tmp['count']) / df.shape[0]
tmp['nunique'] = df[cols].nunique().values
tmp['max_value_counts'] = [df[f].value_counts().values[0] for f in cols]
tmp['max_value_counts_prop'] = tmp['max_value_counts'] / df.shape[0]
tmp['max_value_counts_value'] = [df[f].value_counts().index[0] for f in cols]
tmp.index = cols
tmp

把日期列处理一下，提取年、月、日、星期等信息。这里有些日期异常的样本，月份出现了0，因此需要开个函数单独处理一下

from tqdm import tqdm


def date_proc(x):
    m = int(x[4:6])
    if m == 0:
        m = 1
    return x[:4] + '-' + str(m) + '-' + x[6:]


for f in tqdm(date_cols):
    df[f] = pd.to_datetime(df[f].astype('str').apply(date_proc))
    df[f + '_year'] = df[f].dt.year
    df[f + '_month'] = df[f].dt.month
    df[f + '_day'] = df[f].dt.day
    df[f + '_dayofweek'] = df[f].dt.dayofweek

然后看一下这些日期相关的特征的分布

plt.figure()
plt.figure(figsize=(16, 6))
i = 1
for f in date_cols:
    for col in ['year', 'month', 'day

最低0.47元/天解锁文章