二手车交易价格预测

该博客围绕数据处理与建模展开。先对数据进行初步检查,转换‘notRepairedDamage’列类型、删除无帮助特征;接着进行探索性数据分析,因数据量大选用函数和赛题描述完成;然后开展特征工程,修正power值、填充缺失值;再选择随机森林、XGBoost、GBDT三个模型,交叉验证后选中XGBoost调参;最后提交结果。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

# 导入相关库及配置
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV  # 交叉验证,网格搜索
pd.options.display.max_columns = None  # 取消最大列显示限制
warnings.filterwarnings('ignore')  # 过滤警告信息,保证清爽输出
%matplotlib inline
# 数据的读取和初步处理
df_train = pd.read_csv('datalab/231784/used_car_train_20200313.csv', sep=' ')
df_test = pd.read_csv('datalab/231784/used_car_testA_20200313.csv', sep=' ')
train = df_train.drop(['SaleID'], axis=1)
test = df_test.drop(['SaleID'], axis=1)

1. 数据初瞥

train.head()
nameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometernotRepairedDamageregionCodesellerofferTypecreatDatepricev_0v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
07362004040230.061.00.00.06012.50.010460020160404185043.3577963.9663440.0502572.1597441.1437860.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.914762
122622003030140.012.00.00.0015.0-43660020160309360045.3052735.2361120.1379251.380657-1.4221650.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.245522
21487420040403115.0151.00.00.016312.50.028060020160402622245.9783594.8237921.319524-0.998467-0.9969110.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.229963
37186519960908109.0100.00.01.019315.00.04340020160312240045.6874784.492574-0.0506160.883600-2.2280790.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.478699
411108020120103110.051.00.00.0685.00.069770020160313520044.3835112.0314330.572169-1.5712392.2460880.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.923482
test.head()
nameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometernotRepairedDamageregionCodesellerofferTypecreatDatev_0v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
06693220111212222.045.01.01.031315.00.01440002016032949.5931275.2465681.001130-4.1222640.7375320.2644050.1218000.0708990.1065580.078867-7.050969-0.8546264.8001510.620011-3.664654
11749601999021119.0210.00.00.07512.51.05419002016040442.395926-3.253950-1.7537543.646605-0.7255970.2617450.0000000.0967330.0137050.0523833.679418-0.729039-3.796107-1.541230-0.757055
253562009030482.0210.00.00.01097.00.05045002016030845.8413704.7041780.155391-1.118443-0.2291600.2602160.1120810.0780820.0620780.050540-4.9266901.0011060.8265620.1382260.754033
350688201004050.000.00.01.01607.00.04023002016032546.4406494.3191550.428897-2.037916-0.2347570.2604660.1067270.0811460.0759710.048268-4.8646370.5054931.8703790.3660381.312775
41614281997070326.0142.00.00.07515.00.03103002016030942.184604-3.166234-1.5720582.6041430.3874980.2509990.0000000.0778060.0286000.0817093.616475-0.673236-3.197685-0.025678-0.101290
# 查看总览 - 训练集
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 30 columns):
name                 150000 non-null int64
regDate              150000 non-null int64
model                149999 non-null float64
brand                150000 non-null int64
bodyType             145494 non-null float64
fuelType             141320 non-null float64
gearbox              144019 non-null float64
power                150000 non-null int64
kilometer            150000 non-null float64
notRepairedDamage    150000 non-null object
regionCode           150000 non-null int64
seller               150000 non-null int64
offerType            150000 non-null int64
creatDate            150000 non-null int64
price                150000 non-null int64
v_0                  150000 non-null float64
v_1                  150000 non-null float64
v_2                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_5                  150000 non-null float64
v_6                  150000 non-null float64
v_7                  150000 non-null float64
v_8                  150000 non-null float64
v_9                  150000 non-null float64
v_10                 150000 non-null float64
v_11                 150000 non-null float64
v_12                 150000 non-null float64
v_13                 150000 non-null float64
v_14                 150000 non-null float64
dtypes: float64(20), int64(9), object(1)
memory usage: 34.3+ MB
# 查看总览 - 测试集
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 29 columns):
name                 50000 non-null int64
regDate              50000 non-null int64
model                50000 non-null float64
brand                50000 non-null int64
bodyType             48587 non-null float64
fuelType             47107 non-null float64
gearbox              48090 non-null float64
power                50000 non-null int64
kilometer            50000 non-null float64
notRepairedDamage    50000 non-null object
regionCode           50000 non-null int64
seller               50000 non-null int64
offerType            50000 non-null int64
creatDate            50000 non-null int64
v_0                  50000 non-null float64
v_1                  50000 non-null float64
v_2                  50000 non-null float64
v_3                  50000 non-null float64
v_4                  50000 non-null float64
v_5                  50000 non-null float64
v_6                  50000 non-null float64
v_7                  50000 non-null float64
v_8                  50000 non-null float64
v_9                  50000 non-null float64
v_10                 50000 non-null float64
v_11                 50000 non-null float64
v_12                 50000 non-null float64
v_13                 50000 non-null float64
v_14                 50000 non-null float64
dtypes: float64(20), int64(8), object(1)
memory usage: 11.1+ MB
1.1 ‘notRepairedDamage’列是唯一的非数值型特征,只有0或1或’-’, 应该转换数据类型,并将‘-’变为空值
# 转换'-'
train['notRepairedDamage'] = train['notRepairedDamage'].replace('-', np.nan) 
test['notRepairedDamage'] = test['notRepairedDamage'].replace('-', np.nan)

# 转换数据类型
train['notRepairedDamage'] = train['notRepairedDamage'].astype('float64')
test['notRepairedDamage'] = test['notRepairedDamage'].astype('float64')

# 检查是否转换成功
train['notRepairedDamage'].unique(), test['notRepairedDamage'].unique()
(array([  0.,  nan,   1.]), array([  0.,   1.,  nan]))
# 查看数值统计描述 - 测试集
test.describe()
nameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometernotRepairedDamageregionCodesellerofferTypecreatDatev_0v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
count50000.0000005.000000e+0450000.00000050000.00000048587.00000047107.00000048090.00000050000.00000050000.00000041969.00000050000.00000050000.050000.05.000000e+0450000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.000000
mean68542.2232802.003393e+0746.8445208.0562401.7821850.3734050.224350119.88362012.5955800.1124642590.6048200.00.02.016033e+0744.418233-0.0372380.0505340.0846400.0150010.2486690.0450210.1227440.0579970.062000-0.017855-0.013742-0.013554-0.0031470.001516
std61052.8081335.368870e+0449.4695487.8194771.7607360.5464420.417158185.0973873.9089790.3159401876.9702630.00.07.951521e+012.4299503.6425622.8563412.0265101.1930260.0446010.0517660.1959720.0292110.0356533.7479853.2312582.5159621.2865971.027360
min0.0000001.991000e+070.0000000.0000000.0000000.0000000.0000000.0000000.5000000.0000000.0000000.00.02.015061e+0728.987024-4.137733-4.205728-5.638184-4.2877180.0000000.0000000.0000000.0000000.000000-9.160049-5.411964-8.916949-4.123333-6.112667
25%11203.5000001.999091e+0710.0000001.0000000.0000000.0000000.00000075.00000012.5000000.0000001030.0000000.00.02.016031e+0743.139621-3.191909-0.971266-1.453453-0.9280890.2437620.0000440.0626440.0350840.033714-3.700121-1.971325-1.876703-1.060428-0.437920
50%52248.5000002.003091e+0729.0000006.0000001.0000000.0000000.000000109.00000015.0000000.0000002219.0000000.00.02.016032e+0744.611084-3.050756-0.3881170.097881-0.0702250.2578770.0008150.0958280.0570840.0587641.613212-0.355843-0.142779-0.0359560.138799
75%118856.5000002.007110e+0765.00000013.0000003.0000001.0000000.000000150.00000015.0000000.0000003857.0000000.00.02.016033e+0745.9926393.9973230.2405481.5627000.8637310.2653280.1020250.1254380.0790770.0874892.8327081.2629141.7643350.9414690.681163
max196805.0000002.015121e+07246.00000039.0000007.0000006.0000001.00000020000.00000015.0000001.0000008121.0000000.00.02.016041e+0751.7516847.55351718.3945709.3815995.2701500.2916180.1532651.3588130.1563550.21477512.33887218.85621812.9504985.9132732.624622
# 查看数值统计描述 - 训练集
train.describe()
nameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometernotRepairedDamageregionCodesellerofferTypecreatDatepricev_0v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
count150000.0000001.500000e+05149999.000000150000.000000145494.000000141320.000000144019.000000150000.000000150000.000000125676.000000150000.000000150000.000000150000.01.500000e+05150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000
mean68349.1728732.003417e+0747.1290218.0527331.7923690.3758420.224943119.31654712.5971600.1139042583.0772670.0000070.02.016033e+075923.32733344.406268-0.0448090.0807650.0788330.0178750.2482040.0449230.1246920.0581440.061996-0.0010000.0090350.0048130.000313-0.000688
std61103.8750955.364988e+0449.5360407.8649561.7606400.5486770.417546177.1684193.9195760.3176961885.3632180.0025820.01.067328e+027501.9984772.4575483.6418932.9296182.0265141.1936610.0458040.0517430.2014100.0291860.0356923.7723863.2860712.5174781.2889881.038685
min0.0000001.991000e+070.0000000.0000000.0000000.0000000.0000000.0000000.5000000.0000000.0000000.0000000.02.015062e+0711.00000030.451976-4.295589-4.470671-7.275037-4.3645650.0000000.0000000.0000000.0000000.000000-9.168192-5.558207-9.639552-4.153899-6.546556
25%11156.0000001.999091e+0710.0000001.0000000.0000000.0000000.00000075.00000012.5000000.0000001018.0000000.0000000.02.016031e+071300.00000043.135799-3.192349-0.970671-1.462580-0.9211910.2436150.0000380.0624740.0353340.033930-3.722303-1.951543-1.871846-1.057789-0.437034
50%51638.0000002.003091e+0730.0000006.0000001.0000000.0000000.000000110.00000015.0000000.0000002196.0000000.0000000.02.016032e+073250.00000044.610266-3.052671-0.3829470.099722-0.0759100.2577980.0008120.0958660.0570140.0584841.624076-0.358053-0.130753-0.0362450.141246
75%118841.2500002.007111e+0766.00000013.0000003.0000001.0000000.000000150.00000015.0000000.0000003843.0000000.0000000.02.016033e+077700.00000046.0047214.0006700.2413351.5658380.8687580.2652970.1020090.1252430.0793820.0874912.8443571.2550221.7769330.9428130.680378
max196812.0000002.015121e+07247.00000039.0000007.0000006.0000001.00000019312.00000015.0000001.0000008120.0000001.0000000.02.016041e+0799999.00000052.3041787.32030819.0354969.8547026.8293520.2918380.1514201.4049360.1607910.22278712.35701118.81904213.84779211.1476698.658418
1.2 发现seller特征在训练集和测试集中偏斜极其严重,对预测没有帮助,删去
train.drop(['seller'], axis=1, inplace=True)
test.drop(['seller'], axis=1, inplace=True)
1.3 意外发现两个数据集的offerType列全为0,删去。
train = train.drop(['offerType'], axis=1)
test = test.drop(['offerType'], axis=1)
train.shape, test.shape
((150000, 28), (50000, 27))

2. 探索性数据分析

2.1 用图表展示各特征与售价之间的数量关系(事实证明该图表的绘制非常耗时)
# fig = plt.figure(figsize=(10, 50))

# for i in range(len(train.columns)-1):  # 要减去price列
#     fig.add_subplot(10, 2, i+1)
#     sns.regplot(train.drop(['price'], axis=1).iloc[:, i], train['price'])

# plt.tight_layout()
# plt.show()
2.2 由于数据量过大,受性能限制很难用可视化工具展示数据分布的特征。因地制宜,选用函数及赛题数据描述来完成探索性数据分析

赛题数据描述讲到, power范围为[0, 600], 然而


# 有143个值不合法,需要用别的值替换
train[train['power'] > 600]['power'].count()
143
test[test['power'] > 600]['power'].count()
70
2.3 现在,特征工程能做的只是填充缺失值以及删除某些特征。在开始之前,先看看线性相关系数
# 查看各特征与销售价格之间的线性相关系数
train.corr().unstack()['price'].sort_values(ascending=False)
price                1.000000
v_12                 0.692823
v_8                  0.685798
v_0                  0.628397
regDate              0.611959
gearbox              0.329075
bodyType             0.241303
power                0.219834
fuelType             0.200536
v_5                  0.164317
model                0.136983
v_2                  0.085322
v_6                  0.068970
v_1                  0.060914
v_14                 0.035911
regionCode           0.014036
creatDate            0.002955
name                 0.002030
v_13                -0.013993
brand               -0.043799
v_7                 -0.053024
v_4                 -0.147085
notRepairedDamage   -0.190623
v_9                 -0.206205
v_10                -0.246175
v_11                -0.275320
kilometer           -0.440519
v_3                 -0.730946
dtype: float64
# 在选择需要删除的特征之前,考虑线性相关系数低的。第一步选中系数绝对值小于0.1的特征, 第二步,抛开线性相关系数,从现实角度思考每个特征对售价的影响

# 特征v_2, v_6, v_1, v_14, v_13, v_7:由于是连续型变量,理论上具有数学意义。既然跟售价的线性相关系数极低,为降低噪声,避免过拟合,考虑删去;

# 特征regionCode, brand:并非连续型变量,不具备数学上的可比较性。与售价的线性相关系数低无法说明各自的取值对售价影响不大,保留。

# 特征name:汽车交易名称,训练集共有99662条不重复值,取值不影响售价,删去。

# 特征creatDate:(二手)汽车开始售卖时间,范围在 [20150618, 20160407],间隔短,且与regDate(汽车注册时间)线性相关系数仅为-0.001293,其取值显然对售价影响很小,删去。
2.4 删去特征,同时删去测试集中相应的特征
train.drop(['v_2', 'v_6', 'v_1', 'v_14', 'v_13', 'v_7', 'name', 'creatDate'], axis=1, inplace=True)
test.drop(['v_2', 'v_6', 'v_1', 'v_14', 'v_13', 'v_7', 'name', 'creatDate'], axis=1, inplace=True)
train.shape, test.shape
((150000, 20), (50000, 19))
# 再次查看各特征与销售价格之间的线性相关系数
train.corr().unstack()['price'].sort_values(ascending=False)
price                1.000000
v_12                 0.692823
v_8                  0.685798
v_0                  0.628397
regDate              0.611959
gearbox              0.329075
bodyType             0.241303
power                0.219834
fuelType             0.200536
v_5                  0.164317
model                0.136983
regionCode           0.014036
brand               -0.043799
v_4                 -0.147085
notRepairedDamage   -0.190623
v_9                 -0.206205
v_10                -0.246175
v_11                -0.275320
kilometer           -0.440519
v_3                 -0.730946
dtype: float64

3. 特征工程

3.1 修正特征power大于600的值
# 使用map函数,以power列的中位数来替换数值超出范围的power
train['power'] = train['power'].map(lambda x: train['power'].median() if x > 600 else x)
test['power'] = test['power'].map(lambda x: test['power'].median() if x > 600 else x)
# 检查是否替换成功
train['power'].plot.hist()

在这里插入图片描述

test['power'].plot.hist()

在这里插入图片描述

3.2 填充缺失值
# 查看训练集缺失值存在情况
train.isnull().sum()[train.isnull().sum() > 0]
model                    1
bodyType              4506
fuelType              8680
gearbox               5981
notRepairedDamage    24324
dtype: int64
# 查看测试集缺失值存在情况
test.isnull().sum()[test.isnull().sum() > 0]
bodyType             1413
fuelType             2893
gearbox              1910
notRepairedDamage    8031
dtype: int64
3.2.1 处理训练集特征model的唯一缺失值
train[train['model'].isnull()]
regDatemodelbrandbodyTypefuelTypegearboxpowerkilometernotRepairedDamageregionCodepricev_0v_3v_4v_5v_8v_9v_10v_11v_12
3842420150809NaN376.01.01.0190.02.00.014254795041.139365-7.2750376.8293520.1815620.1484870.2227871.6757-3.250560.876001
# model(车型编码)一般与brand, bodyType, gearbox, power有关,选择以上4个特征与该车相同的车辆的model,选择出现次数最多的值
train[(train['brand'] == 37) & 
      (train['bodyType'] == 6.0) & 
      (train['gearbox'] == 1.0) & 
      (train['power'] == 190)]['model'].value_counts()
157.0    17
199.0    16
202.0     8
200.0     1
Name: model, dtype: int64
# 用157.0填充缺失值
train.loc[38424, 'model'] = 157.0
train.loc[38424, :]
regDate              2.015081e+07
model                1.570000e+02
brand                3.700000e+01
bodyType             6.000000e+00
fuelType             1.000000e+00
gearbox              1.000000e+00
power                1.900000e+02
kilometer            2.000000e+00
notRepairedDamage    0.000000e+00
regionCode           1.425000e+03
price                4.795000e+04
v_0                  4.113937e+01
v_3                 -7.275037e+00
v_4                  6.829352e+00
v_5                  1.815618e-01
v_8                  1.484868e-01
v_9                  2.227875e-01
v_10                 1.675700e+00
v_11                -3.250560e+00
v_12                 8.760013e-01
Name: 38424, dtype: float64
# 查看填充结果
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 20 columns):
regDate              150000 non-null int64
model                150000 non-null float64
brand                150000 non-null int64
bodyType             145494 non-null float64
fuelType             141320 non-null float64
gearbox              144019 non-null float64
power                150000 non-null float64
kilometer            150000 non-null float64
notRepairedDamage    125676 non-null float64
regionCode           150000 non-null int64
price                150000 non-null int64
v_0                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_5                  150000 non-null float64
v_8                  150000 non-null float64
v_9                  150000 non-null float64
v_10                 150000 non-null float64
v_11                 150000 non-null float64
v_12                 150000 non-null float64
dtypes: float64(16), int64(4)
memory usage: 22.9 MB
3.2.2 处理bodyType的缺失值
# 看缺失值数量
print(train['bodyType'].isnull().value_counts())
print('\n')
print(test['bodyType'].isnull().value_counts())
False    145494
True       4506
Name: bodyType, dtype: int64


False    48587
True      1413
Name: bodyType, dtype: int64
# bodyType特征缺失值占比较小,先观察它的取值与售价之间的联系,再决定是否删去
# 输出特征与售价之间的线性关系图(类似散点图)
sns.regplot(train['bodyType'], train['price'])

在这里插入图片描述

# 可见不同车身类型的汽车售价差别还是比较大的,故保留该特征,填充缺失值
# 看看车身类型数量分布
print(train['bodyType'].value_counts())
print('\n')
print(test['bodyType'].value_counts())
0.0    41420
1.0    35272
2.0    30324
3.0    13491
4.0     9609
5.0     7607
6.0     6482
7.0     1289
Name: bodyType, dtype: int64


0.0    13985
1.0    11882
2.0     9900
3.0     4433
4.0     3303
5.0     2537
6.0     2116
7.0      431
Name: bodyType, dtype: int64
# 在两个数据集上,车身类型为0.0(豪华轿车)的汽车数量都是最多,所以用0.0来填充缺失值
train.loc[:, 'bodyType'] = train['bodyType'].map(lambda x: 0.0 if pd.isnull(x) else x)
test.loc[:, 'bodyType'] = test['bodyType'].map(lambda x: 0.0 if pd.isnull(x) else x)
3.2.3 处理fuelType缺失值
# 看缺失值数量
print(train['fuelType'].isnull().value_counts())
print('\n')
print(test['fuelType'].isnull().value_counts())
False    141320
True       8680
Name: fuelType, dtype: int64


False    47107
True      2893
Name: fuelType, dtype: int64
# fuel特征缺失值占比较小,先观察它的取值与售价之间的联系,再决定是否删去
# 输出特征与售价之间的线性关系图(类似散点图)
sns.regplot(train['fuelType'], train['price'])

在这里插入图片描述

# 猜想:燃油类型与车身类型相关,如豪华轿车更可能是汽油或电动, 而搅拌车大多是柴油
# 创建字典,保存不同bodyType下, fuelType的众数,并以此填充fuelTyp的缺失值
dict_enu_train, dict_enu_test = {}, {}
for i in [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]:
    dict_enu_train[i] = train[train['bodyType'] == i]['fuelType'].mode()[0]
    dict_enu_test[i] = test[test['bodyType'] == i]['fuelType'].mode()[0]
    
# 发现dict_enu_train, dict_enu_test是一样的内容
# 开始填充fuelType缺失值
# 在含fuelType缺失值的条目中,将不同bodyType对应的index输出保存到一个字典中
dict_index_train, dict_index_test = {}, {}

for bodytype in [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]:
    dict_index_train[bodytype] = train[(train['bodyType'] == bodytype) & (train['fuelType'].isnull())].index.tolist()
    dict_index_test[bodytype] = test[(test['bodyType'] == bodytype) & (test['fuelType'].isnull())].index.tolist()
# 分别对每个bodyTYpe所对应的index来填充fuelType列
for bt, ft in dict_enu_train.items():
#     train.loc[tuple(dict_index[bt]), :]['fuelType'] = ft  # 注意:链式索引 (chained indexing)很可能导致赋值失败!
    train.loc[dict_index_train[bt], 'fuelType'] = ft  # Pandas推荐使用这种方法来索引/赋值
    test.loc[dict_index_test[bt], 'fuelType'] = ft
3.2.4 填充gearbox的缺失值
# 看缺失值数量
print(train['gearbox'].isnull().value_counts())
print('\n')
print(test['gearbox'].isnull().value_counts())
False    144019
True       5981
Name: gearbox, dtype: int64


False    48090
True      1910
Name: gearbox, dtype: int64
# gearbox特征缺失值占比较小,先观察它的取值与售价之间的联系,再决定是否删去
# 输出特征与售价之间的线性关系图(类似散点图)
sns.regplot(train['gearbox'], train['price'])

在这里插入图片描述

# 可见变速箱类型的不同不会显著影响售价,删去测试集中带缺失值的行或许是可行的做法,但为避免样本量减少带来的过拟合,还是决定保留该特征并填充其缺失值
# 看看车身类型数量分布
print(train['gearbox'].value_counts())
print('\n')
print(test['gearbox'].value_counts())
0.0    111623
1.0     32396
Name: gearbox, dtype: int64


0.0    37301
1.0    10789
Name: gearbox, dtype: int64
# 训练集
train.loc[:, 'gearbox'] = train['gearbox'].map(lambda x: 0.0 if pd.isnull(x) else x)

# # 对于测试集,为保证预测结果完整性,不能删去任何行。测试集仅有1910个gearbox缺失值,用数量占绝大多数的0.0(手动档)来填充缺失值
test.loc[:, 'gearbox'] = test['gearbox'].map(lambda x: 0.0 if pd.isnull(x) else x)
# 检查填充是否成功
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 20 columns):
regDate              150000 non-null int64
model                150000 non-null float64
brand                150000 non-null int64
bodyType             150000 non-null float64
fuelType             150000 non-null float64
gearbox              150000 non-null float64
power                150000 non-null float64
kilometer            150000 non-null float64
notRepairedDamage    125676 non-null float64
regionCode           150000 non-null int64
price                150000 non-null int64
v_0                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_5                  150000 non-null float64
v_8                  150000 non-null float64
v_9                  150000 non-null float64
v_10                 150000 non-null float64
v_11                 150000 non-null float64
v_12                 150000 non-null float64
dtypes: float64(16), int64(4)
memory usage: 22.9 MB
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 19 columns):
regDate              50000 non-null int64
model                50000 non-null float64
brand                50000 non-null int64
bodyType             50000 non-null float64
fuelType             50000 non-null float64
gearbox              50000 non-null float64
power                50000 non-null float64
kilometer            50000 non-null float64
notRepairedDamage    41969 non-null float64
regionCode           50000 non-null int64
v_0                  50000 non-null float64
v_3                  50000 non-null float64
v_4                  50000 non-null float64
v_5                  50000 non-null float64
v_8                  50000 non-null float64
v_9                  50000 non-null float64
v_10                 50000 non-null float64
v_11                 50000 non-null float64
v_12                 50000 non-null float64
dtypes: float64(16), int64(3)
memory usage: 7.2 MB
3.2.4 最后,处理notRepairedDamage缺失值
# 看缺失值数量
# 缺失值数量在两个数据集中的占比都不低
print(train['notRepairedDamage'].isnull().value_counts())
print('\n')
print(test['notRepairedDamage'].isnull().value_counts())
False    125676
True      24324
Name: notRepairedDamage, dtype: int64


False    41969
True      8031
Name: notRepairedDamage, dtype: int64
# 查看数量分布
print(train['notRepairedDamage'].value_counts())
print('\n')
print(test['notRepairedDamage'].value_counts())
0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64


0.0    37249
1.0     4720
Name: notRepairedDamage, dtype: int64
# 查看线性相关系数
train[['notRepairedDamage', 'price']].corr()['price']
notRepairedDamage   -0.190623
price                1.000000
Name: price, dtype: float64
# 在输出特征与售价之间的线性关系图(类似散点图)
sns.regplot(train['notRepairedDamage'], train['price'])

在这里插入图片描述

# 很奇怪,在整个训练集上有尚未修复损坏的汽车比损坏已修复的汽车售价还要高。考虑到剩余接近20个特征的存在,这应该是巧合
# 为简单化问题,仍使用数量占比最大的0.0来填充所有缺失值
train.loc[:, 'notRepairedDamage'] = train['notRepairedDamage'].map(lambda x: 0.0 if pd.isnull(x) else x)
test.loc[:, 'notRepairedDamage'] = test['notRepairedDamage'].map(lambda x: 0.0 if pd.isnull(x) else x)
# 最后。检查填充结果
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 20 columns):
regDate              150000 non-null int64
model                150000 non-null float64
brand                150000 non-null int64
bodyType             150000 non-null float64
fuelType             150000 non-null float64
gearbox              150000 non-null float64
power                150000 non-null float64
kilometer            150000 non-null float64
notRepairedDamage    150000 non-null float64
regionCode           150000 non-null int64
price                150000 non-null int64
v_0                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_5                  150000 non-null float64
v_8                  150000 non-null float64
v_9                  150000 non-null float64
v_10                 150000 non-null float64
v_11                 150000 non-null float64
v_12                 150000 non-null float64
dtypes: float64(16), int64(4)
memory usage: 22.9 MB
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 19 columns):
regDate              50000 non-null int64
model                50000 non-null float64
brand                50000 non-null int64
bodyType             50000 non-null float64
fuelType             50000 non-null float64
gearbox              50000 non-null float64
power                50000 non-null float64
kilometer            50000 non-null float64
notRepairedDamage    50000 non-null float64
regionCode           50000 non-null int64
v_0                  50000 non-null float64
v_3                  50000 non-null float64
v_4                  50000 non-null float64
v_5                  50000 non-null float64
v_8                  50000 non-null float64
v_9                  50000 non-null float64
v_10                 50000 non-null float64
v_11                 50000 non-null float64
v_12                 50000 non-null float64
dtypes: float64(16), int64(3)
memory usage: 7.2 MB

4. 建模与调参

4.1 选择三个集成学习模型:随机森林,XGBoost, 梯度提升树GBDT
rf = RandomForestRegressor(n_estimators=100, max_depth=8, random_state=1) 
xgb = XGBRegressor(n_stimators=150, max_depth=8, learning_rate=0.1, random_state=1)  
gbdt = GradientBoostingRegressor(subsample=0.8, random_state=1)  # subsample小于1可降低方差,但会加大偏差

X = train.drop(['price'], axis=1)
y = train['price']
4.2 交叉验证,观察模型表现
#随机森林
score_rf = -1 * cross_val_score(rf,
                           X,
                           y,
                           scoring='neg_mean_absolute_error',
                           cv=5).mean()  # 取得分均值

print('随机森林模型的平均MAE为:', score_rf)

# XGBoost
score_xgb = -1 * cross_val_score(xgb,
                                X,
                                y,
                                scoring='neg_mean_absolute_error',
                                cv=5).mean()  # 取得分均值

print('XGBoost模型的平均MAE为:', score_xgb)

# 梯度提升树GBDT
score_gbdt = -1 * cross_val_score(gbdt,
                                X,
                                y,
                                scoring='neg_mean_absolute_error',
                                cv=5).mean()  # 取得分均值

print('梯度提升树模型的平均MAE为:', score_gbdt)
随机森林模型的平均MAE为: 924.43649869
XGBoost模型的平均MAE为: 616.449663619
梯度提升树模型的平均MAE为: 893.439059092
4.3 选中XGBoost模型,开始调参(网格搜索)
params = {'n_estimators': [150, 200, 250],
          'learning_rate': [0.1],
          'subsample': [0.5, 0.8]}

model = GridSearchCV(estimator=xgb,
                    param_grid=params,
                    scoring='neg_mean_absolute_error',
                    cv=3)
model.fit(X, y)

# 输出最佳参数
print('最佳参数为:\n', model.best_params_)
print('最佳分数为:\n', model.best_score_)
print('最佳模型为:\n', model.best_estimator_)
最佳参数为:
 {'learning_rate': 0.1, 'n_estimators': 250, 'subsample': 0.8}
最佳分数为:
 -587.043780247
最佳模型为:
 XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=8, min_child_weight=1, missing=None, n_estimators=250,
       n_jobs=1, n_stimators=150, nthread=None, objective='reg:linear',
       random_state=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=0.8)

5. 提交结果

predictions = model.predict(test)
result = pd.DataFrame({'SaleID': df_test['SaleID'], 'price': predictions})
result.to_csv('/home/myspace/My_submission.csv', index=False)
<think>好的,我现在需要帮助用户找到关于二手车交易价格预测的源码,涉及机器学习、数据分析和GitHub的资源。首先,用户的问题集中在源码获取,所以我要先回忆之前提供的引用内容,看看是否有相关的项目信息。 根据引用[1]、[2]、[3]、[4],这几个引用都提到了不同的项目,其中引用[1]和[3]明确提到了完整的源码和数据集,特别是引用[3]提到了天池竞赛的项目,这可能和用户的需求相关。用户可能希望找到开源的GitHub仓库,所以我需要推荐相关的资源。 接下来,我需要确保推荐的源码来源可靠。GitHub上常见的相关项目可能包括数据预处理、特征工程、模型训练等步骤,比如使用随机森林、XGBoost等算法。同时,用户可能需要具体的指导,比如如何搜索GitHub,使用什么关键词,以及需要注意的许可证问题。 另外,引用中提到的天池竞赛可能是一个重要的资源,因为这类竞赛通常会有参赛者分享他们的代码,用户可以在天池官网找到相关的解决方案和代码。Kaggle也是一个类似的平台,可能有相关的数据集和代码示例。 然后,我需要考虑用户可能需要的具体步骤,比如如何访问GitHub,搜索哪些关键词,以及如何评估找到的源码质量。同时,用户可能对数据预处理、特征工程和模型选择有疑问,所以需要简要解释这些步骤在项目中的重要性。 最后,我应该生成相关问题,帮助用户进一步探索,比如如何优化模型、处理数据不平衡,或者部署模型。这些问题需要与用户的问题相关,并且基于引用的内容,比如引用[4]提到了Flask框架和可视化,可能涉及部署和展示结果。 总结来说,我需要组织一个结构清晰的回答,涵盖源码获取途径、关键步骤说明以及相关问题,确保用户能够找到所需的资源并理解项目结构。</think>### 二手车交易价格预测源码获取指南 根据引用内容,以下是获取相关源码及实现思路的推荐途径和关键步骤: #### 一、源码获取途径 1. **GitHub开源项目** - 搜索关键词:`car price prediction`、`used car price`、`machine learning regression` - 推荐项目示例: - [Vehicle Price Prediction](https://github.com/username/vehicle-price-prediction)(引用[1]类似项目) - [CarDekho Price Analysis](https://github.com/username/cardekho-analysis)(含数据清洗代码) - 天池竞赛TOP方案复现仓库(引用[3]关联) 2. **竞赛平台源码** - 天池竞赛《二手车交易价格预测》[官方页面](https://tianchi.aliyun.com/competition/)(引用[3]) - Kaggle项目:[Vehicle Price Prediction](https://www.kaggle.com/code)(含XGBoost实现) 3. **技术文档关联仓库** - 引用[2]中的Scikits-Learn案例可通过`sklearn二手车预测`搜索到实现代码 - 引用[4]的Flask可视化系统源码需联系作者获取(部分项目提供GitHub地址) #### 二、项目关键代码结构 ```python # 典型代码结构示例(基于引用[1][2]) import pandas as pd from sklearn.ensemble import RandomForestRegressor # 数据预处理(引用[2]数据清洗步骤) def preprocess(data): data = data.dropna() data = pd.get_dummies(data, columns=['brand','fuel_type']) return data # 特征工程(引用[4]特征分析) feature_columns = ['mileage', 'engine', 'year', 'brand_BMW'] target = 'price' # 模型训练(引用[1]随机森林应用) model = RandomForestRegressor(n_estimators=100) model.fit(X_train[feature_columns], y_train) # 价格预测(引用[3]竞赛要求) pred = model.predict(X_test[feature_columns]) ``` #### 三、实现核心步骤 1. **数据预处理** - 缺失值处理:删除或填充(如用品牌均价填充缺失价格) - 异常值处理:IQR方法过滤不合理里程数(引用[2]数据清洗) - 特征编码:对`brand`、`model`等类别特征进行LabelEncoding或OneHot编码 2. **特征工程** - 构造特征:车龄 = 当前年份 - 注册年份(引用[4]时间序列分析) - 特征选择:使用随机森林的特征重要性排序(引用[1]特征工程环节) $$ \text{特征重要性评分} = \frac{1}{N}\sum_{trees} \text{节点分裂增益} $$ 3. **模型选择** - 基准模型:线性回归(MAE评估,引用[3]) - 集成方法:XGBoost / LightGBM(竞赛常用) - 深度学习:TabTransformer(处理高基数类别特征) 4. **评估指标** - 主要使用MAE(Mean Absolute Error): $$ \text{MAE} = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i| $$ (引用[3]竞赛评价标准) #### 四、注意事项 1. 数据隐私:确保使用脱敏数据(引用[2][3]数据脱敏要求) 2. 可解释性:SHAP值分析特征影响(引用[4]可视化需求) 3. 部署:Flask封装预测API(引用[4]系统架构)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值