文章目录
- 使用机器学习预测推土机的售价
- 零、导入模块
- 一、EDA
-
- 1.1 查看基本信息
- 1.2 特征类型转换
- 1.3 联表+特征初筛
- 1.4 逐个查看特征
-
- 1.4.1 datasource
- 1.4.2 auctioneerID
- 1.4.3 YearMade
- 1.4.4 MachineHoursCurrentMeter
- 1.4.5 UsageBand
- 1.4.6 fiBaseModel
- 1.4.7 fiSecondaryDesc
- 1.4.8 fiModelSeries
- 1.4.9 fiModelDescriptor
- 1.4.10 ProductSize
- 1.4.11 state
- 1.4.12 ProductGroupDesc
- 1.4.13 Drive_System
- 1.4.14 Enclosure
- 1.4.15 Forks
- 1.4.16 Pad_Type
- 1.4.17 Ride_Control
- 1.4.18 Stick
- 1.4.19 Transmission
- 1.4.20 Turbocharged
- 1.4.21 Blade_Extension
- 1.4.22 Blade_Width
- 1.4.23 Enclosure_Type
- 1.4.24 Engine_Horsepower
- 1.4.24 Hydraulics
- 1.4.25 Pushblock
- 1.4.26 Ripper
- 1.4.27 Scarifier
- 1.4.28 Tip_Control
- 1.4.29 Tire_Size
- 1.4.30 Coupler
- 1.4.31 Coupler_System
- 1.4.32 Grouser_Tracks
- 1.4.33 Hydraulics_Flow
- 1.4.34 Track_Type
- 1.4.35 Undercarriage_Pad_Width
- 1.4.36 Stick_Length
- 1.4.37 Thumb
- 1.4.38 Pattern_Changer
- 1.4.39 Grouser_Type
- 1.4.40 Backhoe_Mounting
- 1.4.41 Blade_Type
- 1.4.42 Travel_Controls
- 1.4.43 Differential_Type
- 1.4.44 Steering_Controls
- 1.4.45 fiManufacturerDesc
- 1.4.46 PrimarySizeBasis
- 1.4.47 PrimaryLower
- 2. 数据清洗+数据预处理
-
- 2.1 创建衍生变量
-
- 2.1.1 SaleDate
- 2.1.2 Stick、Turbocharged
- 2.1.3 Blade_Extension、Blade_Width、Enclosure_Type、Engine_Horsepower
- 2.1.4 Pushblock、Scarifier、Tip_Control
- 2.1.5 Coupler_System、Grouser_Tracks、Hydraulics_Flow
- 2.1.6 Track_Type Undercarriage_Pad_Width Stick_Length Thumb Pattern_Changer Grouser_Type
- 2.1.7 Backhoe_Mounting Blade_Type Travel_Controls
- 2.1.8 Differential_Type Steering_Controls
- 2.1.9 PrimarySizeBasis PrimaryLower
- 2.1.10 fiBaseModel
- 2.1.11 ProductGroupDesc
- 2.2 独热编码
- 2.3 标签编码
- 2.4 空值和异常值
- 2.5 删除重复列
- 3. 建模
- 4. 特征筛选
- 5. 最终成果
使用机器学习预测推土机的售价
1. 定义问题
考虑到推土机的特性,利用过去的数据,我们能多大程度上预测它未来的价格?
2. 数据来源
kaggle:https://www.kaggle.com/competitions/bluebook-for-bulldozers/overview
The data for this competition is split into three parts:
- Train.csv is the training set, which contains data through the end of 2011.
- Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
- Test.csv is the test set, which won’t be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.
3. 评价标准
The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.
更多信息:https://www.kaggle.com/competitions/bluebook-for-bulldozers/overview/evaluation
注意:多数回归模型的评价标准都是减小误差。比如这次的目标就是最小化RMSLE。
4. 使用的特征
特征过多,请自行进入kaggle项目主页查看。或点击如下谷歌表格链接:https://docs.google.com/spreadsheets/d/1EIbdGa4S_46USXgg0OHX5jgTc8ld9fTHwPyi_VOV1as/edit#gid=0
零、导入模块
# EDA
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
sns.set()
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
%config InlineBackend.figure_config = 'svg'
warnings.filterwarnings("ignore")
# 数据预处理
from sklearn.preprocessing import LabelEncoder
# sklearn模型
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
# 模型评估
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_log_error, mean_absolute_error, mean_squared_error, r2_score
一、EDA
bulldozer_df = pd.read_csv('bluebook-for-bulldozers/TrainAndValid.csv',
low_memory = False)
appendix_df = pd.read_csv('bluebook-for-bulldozers/Machine_Appendix.csv',
low_memory = False)
1.1 查看基本信息
# 查看各字段类型
bulldozer_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412698 entries, 0 to 412697
Data columns (total 53 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SalesID 412698 non-null int64
1 SalePrice 412698 non-null float64
2 MachineID 412698 non-null int64
3 ModelID 412698 non-null int64
4 datasource 412698 non-null int64
5 auctioneerID 392562 non-null float64
6 YearMade 412698 non-null int64
7 MachineHoursCurrentMeter 147504 non-null float64
8 UsageBand 73670 non-null object
9 saledate 412698 non-null object
10 fiModelDesc 412698 non-null object
11 fiBaseModel 412698 non-null object
12 fiSecondaryDesc 271971 non-null object
13 fiModelSeries 58667 non-null object
14 fiModelDescriptor 74816 non-null object
15 ProductSize 196093 non-null object
16 fiProductClassDesc 412698 non-null object
17 state 412698 non-null object
18 ProductGroup 412698 non-null object
19 ProductGroupDesc 412698 non-null object
20 Drive_System 107087 non-null object
21 Enclosure 412364 non-null object
22 Forks 197715 non-null object
23 Pad_Type 81096 non-null object
24 Ride_Control 152728 non-null object
25 Stick 81096 non-null object
26 Transmission 188007 non-null object
27 Turbocharged 81096 non-null object
28 Blade_Extension 25983 non-null object
29 Blade_Width 25983 non-null object
30 Enclosure_Type 25983 non-null object
31 Engine_Horsepower 25983 non-null object
32 Hydraulics 330133 non-null object
33 Pushblock 25983 non-null object
34 Ripper 106945 non-null object
35 Scarifier 25994 non-null object
36 Tip_Control 25983 non-null object
37 Tire_Size 97638 non-null object
38 Coupler 220679 non-null object
39 Coupler_System 44974 non-null object
40 Grouser_Tracks 44875 non-null object
41 Hydraulics_Flow 44875 non-null object
42 Track_Type 102193 non-null object
43 Undercarriage_Pad_Width 102916 non-null object
44 Stick_Length 102261 non-null object
45 Thumb 102332 non-null object
46 Pattern_Changer 102261 non-null object
47 Grouser_Type 102193 non-null object
48 Backhoe_Mounting 80712 non-null object
49 Blade_Type 81875 non-null object
50 Travel_Controls 81877 non-null object
51 Differential_Type 71564 non-null object
52 Steering_Controls 71522 non-null object
dtypes: float64(3), int64(5), object(45)
memory usage: 166.9+ MB
# 查看缺失值
bulldozer_df.isna().sum()
SalesID 0
SalePrice 0
MachineID 0
ModelID 0
datasource 0
auctioneerID 20136
YearMade 0
MachineHoursCurrentMeter 265194
UsageBand 339028
saledate 0
fiModelDesc 0
fiBaseModel 0
fiSecondaryDesc 140727
fiModelSeries 354031
fiModelDescriptor 337882
ProductSize 216605
fiProductClassDesc 0
state 0
ProductGroup 0
ProductGroupDesc 0
Drive_System 305611
Enclosure 334
Forks 214983
Pad_Type 331602
Ride_Control 259970
Stick 331602
Transmission 224691
Turbocharged 331602
Blade_Extension 386715
Blade_Width 386715
Enclosure_Type 386715
Engine_Horsepower 386715
Hydraulics 82565
Pushblock 386715
Ripper 305753
Scarifier 386704
Tip_Control 386715
Tire_Size 315060
Coupler 192019
Coupler_System 367724
Grouser_Tracks 367823
Hydraulics_Flow 367823
Track_Type 310505
Undercarriage_Pad_Width 309782
Stick_Length 310437
Thumb 310366
Pattern_Changer 310437
Grouser_Type 310505
Backhoe_Mounting 331986
Blade_Type 330823
Travel_Controls 330821
Differential_Type 341134
Steering_Controls 341176
dtype: int64
# 查看标签分布
bulldozer_df['SalePrice'].hist()
<AxesSubplot:>
1.2 特征类型转换
# 改为帕斯卡命名
bulldozer_df.rename(columns={
'saledate': 'SaleDate'}, inplace=True)
bulldozer_df['SaleDate'] = pd.to_datetime(bulldozer_df['SaleDate'])
# 按时间查看售价,只看最近几年的,没有明显的上升趋势,只有上下波动
plt.figure(figsize=(20,15))
pd.pivot_table(bulldozer_df[200000::1000], index='SaleDate', values='SalePrice').plot()
plt.show()
2005年前后半年和2008年整个一年销量都比较差,除此之外看不出太多信息
# 将数据集按时间排序
bulldozer_df.sort_values(by='SaleDate', inplace=True)
bulldozer_df['SaleDate'].head(20)
205615 1989-01-17
274835 1989-01-31
141296 1989-01-31
212552 1989-01-31
62755 1989-01-31
54653 1989-01-31
81383 1989-01-31
204924 1989-01-31
135376 1989-01-31
113390 1989-01-31
113394 1989-01-31
116419 1989-01-31
32138 1989-01-31
127610 1989-01-31
76171 1989-01-31
127000 1989-01-31
128130 1989-01-31
127626 1989-01-31
55455 1989-01-31
55454 1989-01-31
Name: SaleDate, dtype: datetime64[ns]
1.3 联表+特征初筛
这个数据集比较特殊,还有个appendix表,里面是一些推土机的配件信息,而且这个信息和train表有重复特征,重复特征里面还有匹配不上的情况,现在先联表上来看看
# 制作数据集副本,这是为了方便对数据集做了什么操作后,仍然可以获取原始数据,而不用从头读数据
bd_df = bulldozer_df.copy()
app_df = appendix_df.copy()
# SalesID列丢弃
bd_df.drop('SalesID', axis=1, inplace=True)
# 定义一个查看出入的函数
def check_difference(df1, df2, target_col, on_col):
temp_df = pd.merge(df1[[on_col, target_col]], df2[[on_col, target_col]], how='left', on=on_col)
return temp_df[(temp_df[target_col+'_x'] != temp_df[target_col+'_y'])]
# 定义一个合并时互补的函数,冲突时保留df1的数据
def combine(df1, df2, target_col, on_col):
temp_df0 = pd.merge(df1[[on_col, target_col]], df2[[on_col, target_col]], how='left', on=on_col)
temp_df0.fillna('', inplace=True)
temp_df1=temp_df0[(temp_df0[target_col+'_x']== '') & (temp_df0[target_col+'_y']!='')]
temp_df0[target_col+'_x'].loc[temp_df1.index] = temp_df1[target_col+'_y']
df1[target_col] = temp_df0[target_col+'_x']
df2.drop(columns=[target_col], inplace=True)
return df1
1.3.1 删除包含重复信息的特征
ProductGroup是ProductGroupDesc的首字母缩写版,选择保留后者
fiManufacturerID和fiManufacturerDesc包含的信息一样,选择保留前者
bd_df.drop(columns=['ProductGroup', 'fiProductClassDesc'], inplace=True)
app_df.drop(columns=['ModelID', 'fiModelDesc', 'ProductGroup', 'fiManufacturerDesc'], inplace=True)
1.3.2 fiBaseModel
# 查看枚举值,后续需要做分箱合并处理
bd_df['fiBaseModel'].value_counts()
580 20179
310 17886
D6 13527
416 12900
D5 9636
...
56 1
B4230 1
IS30 1
MM555 1
WLK15 1
Name: fiBaseModel, Length: 1961, dtype: int64
# 查看出入部分有无空值
check_difference(bd_df, app_df, 'fiBaseModel', 'MachineID').isna().sum()
MachineID 0
fiBaseModel_x 0
fiBaseModel_y 0
dtype: int64
# 查看出入部分
check_difference(bd_df, app_df, 'fiBaseModel', 'MachineID')
MachineID | fiBaseModel_x | fiBaseModel_y | |
---|---|---|---|
71 | 1523610 | WA150 | HD465 |
98 | 1059447 | WA300 | PC100 |
123 | 1303779 | 415 | 862 |
128 | 1340389 | MS240 | MS120 |
179 | 1208516 | D31 | PC100 |
... | ... | ... | ... |
412474 | 2308891 | 1845 | 75 |
412484 | 2287735 | T135 | T133 |
412509 | 2292146 | 450 | 465 |
412655 | 1846321 | TB135 | TB125 |
412695 | 1918416 | 337 | 530 |
12452 rows × 3 columns
# 选择保留bd_df表的数据
app_df.drop(columns=['fiBaseModel'], inplace=True)
1.3.3 fiSecondaryDesc
# 查看枚举值,后续需要做分箱合并处理
bd_df['fiSecondaryDesc'].value_counts()
C 44431
B 40165
G 37915
H 24729
E 21532
...
BLGPPS 1
MSR 1
LC7A 1
CL 1
BH 1
Name: fiSecondaryDesc, Length: 177, dtype: int64
# 查看不一致部分
check_difference(bd_df, app_df, 'fiSecondaryDesc', 'MachineID')
MachineID | fiSecondaryDesc_x | fiSecondaryDesc_y | |
---|---|---|---|
0 | 1126363 | NaN | NaN |
1 | 1194089 | NaN | NaN |
3 | 1327630 | NaN | NaN |
6 | 1082797 | NaN | NaN |
7 | 1527216 | NaN | NaN |
... | ... | ... | ... |
412690 | 1823846 | NaN | NaN |
412691 | 1278794 | NaN | NaN |
412692 | 1792049 | NaN | NaN |
412694 | 1919104 | NaN | NaN |
412695 | 1918416 | G | NaN |
147009 rows × 3 columns
# 合并互补
bd_df = combine(bd_df, app_df, 'fiSecondaryDesc', 'MachineID')
1.3.4 fiModelSeries
# 查看枚举值,后续需要做分箱合并处理
bd_df['fiModelSeries'].value_counts()
II 13770
LC 9175
III 5351
-1 4646
-2 4033
...
LL 1
6F 1
-2LC 1
-5A 1
VII 1
Name: fiModelSeries, Length: 123, dtype: int64
# 查看不一致部分
check_difference(bd_df, app_df, 'fiModelSeries', 'MachineID')
MachineID | fiModelSeries_x | fiModelSeries_y | |
---|---|---|---|
0 | 1126363 | NaN | NaN |
1 | 1194089 | NaN | NaN |
2 | 1473654 | NaN | NaN |
3 | 1327630 | NaN | NaN |
4 | 1336053 | NaN | NaN |
... | ... | ... | ... |
412693 | 1915521 | NaN | NaN |
412694 | 1919104 | NaN | NaN |
412695 | 1918416 | NaN | NaN |
412696 | 509560 | NaN | NaN |
412697 | 1869284 | NaN | NaN |
362366 rows × 3 columns
# 合并互补
bd_df = combine(bd_df, app_df, 'fiModelSeries', 'MachineID')
1.3.5 fiModelDescriptor
# 查看枚举值
bd_df['fiModelDescriptor'].value_counts()
L 16464
LGP 16143
LC 13295
XL 6700
6 2944
...
K5 1
HighLift 1
High Lift 1
III 1
SL 1
Name: fiModelDescriptor, Length: 140, dtype: int64
# 查看不一致部分
check_difference(bd_df, app_df, 'fiModelDescriptor', 'MachineID')
MachineID | fiModelDescriptor_x | fiModelDescriptor_y | |
---|---|---|---|
0 | 1126363 | NaN | NaN |
1 | 1194089 | NaN | NaN |
2 | 1473654 | NaN | NaN |
3 | 1327630 | NaN | NaN |
4 | 1336053 | NaN | NaN |
... | ... | ... | ... |
412693 | 1915521 | NaN | NaN |
412694 | 1919104 | NaN | NaN |
412695 | 1918416 | NaN | NaN |
412696 | 509560 | NaN | NaN |
412697 | 1869284 | NaN | NaN |
342069 rows × 3 columns
# 合并互补
bd_df = combine(bd_df, app_df, 'fiModelDescriptor', 'MachineID')
1.3.6 ProductGroupDesc
# 查看枚举值
bd_df['ProductGroupDesc'].value_counts()
Track Excavators 104230
Track Type Tractors 82582
Backhoe Loaders 81401
Wheel Loader 73216
Skid Steer Loaders 45011
Motor Graders 26258
Name: ProductGroupDesc, dtype: int64
# 枚举值太多了,而且bd_df里的数据并无空值,这里就选择不互补了,直接保留bd_df的数据
app_df['ProductGroupDesc'].value_counts()
Track Excavators 89094
Backhoe Loaders 74074
Track Type Tractors 67362
Wheel Loader 62426
Skid Steer Loaders 42121
Motor Graders 22602
Track Loaders 158
Articulated Trucks 109
Ag Tractors 103
Wheel Tractor Scraper 102
Off Highway Trucks 81
Multi Terrain Loaders 66
Wheel Excavator 66
Forklift 53
Skidders 46
Wheel Feller Buncher 31
Forestry Log Loaders 25
Pipelayers 11
Telehandler 9
Wheel Dozer 7
Knuckleboom Loaders 6
Track Feller Bunchers 6
Vibratory Double Drum Asphalt 4
Vibratory Single Drum Asphalt 4
Work Tool 3
Pneumatic Tired Compactor 3
Compactors 3
Vibratory Single Drum Pad 3
Tandem Roller Static 2
Harvesters 2
Forwarders 2
Engine, Industrial OEM 2
Track Harvesters 1
Delimber Forestry 1
Asphalt/Concrete Pavers 1
Vibratory Single Drum Smooth 1
Crane/Dragline 1
Forestry Excavators 1
Cold Planers 1
Name: ProductGroupDesc, dtype: int64
app_df.drop(columns=['ProductGroupDesc'], inplace=True)
1.3.7 MfgYear
# 两边同一个特征名字不一样,先改名
app_df.rename(columns={
'MfgYear': 'YearMade'}, inplace=True)
# 查看枚举值,发现异常值1000
bd_df['YearMade'].value_counts()
1000 39391
2005 22096
1998 21751
2004 20914
1999 19274
...
2012 1
1949 1
1942 1
2013 1
1937 1
Name: YearMade, Length: 73, dtype: int64
# 查看不一致部分
check_difference(bd_df, app_df, 'YearMade', 'MachineID')
MachineID | YearMade_x | YearMade_y | |
---|---|---|---|
52 | 1531656 | 1975 | 1987.0 |
58 | 1078132 | 1967 | 1966.0 |
71 | 1523610 | 1986 | 2007.0 |
98 | 1059447 | 1984 | 1996.0 |
122 | 1525180 | 1984 | 1985.0 |
... | ... | ... | ... |
412651 | 1631729 | 1000 | 1990.0 |
412654 | 1912106 | 1000 | 1996.0 |
412655 | 1846321 | 1000 | 2005.0 |
412667 | 1897535 | 1998 | 1999.0 |
412696 | 509560 | 1993 | 1990.0 |
36406 rows × 3 columns
这种不一致通常是因为bd_df里面有重复的MachineID,即同一种推土机被卖了多次,而app_df里只有唯一MachineID所导致,决定保留bd_df的数据,不动它
app_df.drop('YearMade', axis=1, inplace=True)
1.3.8 fiManufacturerID、PrimarySizeBasis、PrimaryLower、PrimaryUpper
这三个特征原训练集上都没有,直接联过去
# total_df = bd_df.copy()
total_df = pd.merge(bd_df, app_df, how='left', on='MachineID')
# 动力上限和下限留一个就足够区分了,取下限
total_df.drop(columns=['PrimaryUpper'], inplace=True)
total_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 412698 entries, 0 to 412697
Data columns (total 54 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SalePrice 412698 non-null float64
1 MachineID 412698 non-null int64
2 ModelID 412698 non-null int64
3 datasource 412698 non-null int64
4 auctioneerID 392562 non-null float64
5 YearMade 412698 non-null int64
6 MachineHoursCurrentMeter 147504 non-null float64
7 UsageBand 73670 non-null object
8 SaleDate 412698 non-null datetime64[ns]
9 fiModelDesc 412698 non-null object
10 fiBaseModel 412698 non-null object
11 fiSecondaryDesc 412698 non-null object
12 fiModelSeries 412698 non-null object
13 fiModelDescriptor 412698 non-null object
14 ProductSize 196093 non-null object
15 state 412698 non-null object
16 ProductGroupDesc 412698 non-null object
17 Drive_System 107087 non-null object
18 Enclosure 412364 non-null object
19 Forks 197715 non-null object
20 Pad_Type 81096 non-null object
21 Ride_Control 152728 non-null object
22 Stick 81096 non-null object
23 Transmission 188007 non-null object
24 Turbocharged 81096 non-null object
25 Blade_Extension 25983 non-null object
26 Blade_Width 25983 non-null object
27 Enclosure_Type 25983 non-null object
28 Engine_Horsepower 25983 non-null object
29 Hydraulics 330133 non-null object
30 Pushblock 25983 non-null object
31 Ripper 106945 non-null object
32 Scarifier 25994 non-null object
33 Tip_Control 25983 non-null object
34 Tire_Size 97638 non-null object
35 Coupler 220679 non-null object
36 Coupler_System 44974 non-null object
37 Grouser_Tracks 44875 non-null object
38 Hydraulics_Flow 44875 non-null object
39 Track_Type 102193 non-null object
40 Undercarriage_Pad_Width 102916 non-null object
41 Stick_Length 102261 non-null object
42 Thumb 102332 non-null object
43 Pattern_Changer 102261 non-null object
44 Grouser_Type 102193 non-null object
45 Backhoe_Mounting 80712 non-null object
46 Blade_Type 81875 non-null object
47 Travel_Controls 81877 non-null object
48 Differential_Type 71564 non-null object
49 Steering_Controls 71522 non-null object
50 fiProductClassDesc 412698 non-null object
51 fiManufacturerID 412698 non-null int64
52 PrimarySizeBasis 407439 non-null object
53 PrimaryLower 407439 non-null float64
dtypes: datetime64[ns](1), float64(4), int64(5), object(44)
memory usage: 173.2+ MB
raise KeyError
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Input In [903], in <cell line: 1>()
----> 1 raise KeyError
KeyError:
1.4 逐个查看特征
1.4.1 datasource
# 173怀疑是172写错成173
total_df['datasource'].value_counts()
132 260776
136 75491
149 33325
121 25191
172 17914
173 1
Name: datasource, dtype: int64
1.4.2 auctioneerID
# 后续要合并
total_df['auctioneerID'].value_counts()
1.0 192773
2.0 57441
3.0 30288
4.0 20877
99.0 12042
6.0 11950
7.0 7847
8.0 7419
5.0 7002
10.0 5876
9.0 4764
11.0 3823
12.0 3610
13.0 3068
18.0 2359
14.0 2277
20.0 2238
19.0 2074
16.0 1807
15.0 1742
21.0 1601
22.0 1429
24.0 1357
23.0 1322
17.0 1275
27.0 1150
25.0 959
28.0 860
26.0 796
0.0 536
Name: auctioneerID, dtype: int64
# 查看auctioneerID和价格的关系
temp = pd.pivot_table(total_df, index='auctioneerID', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('median', 'SalePrice'))
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
auctioneerID | |||
22.0 | 1429 | 18230.020994 | 14000.0 |
18.0 | 2359 | 18839.799491 | 14500.0 |
21.0 | 1601 | 19645.315428 | 15000.0 |
25.0 | 959 | 22478.519291 | 16000.0 |
9.0 | 4764 | 22515.003149 | 16775.0 |
14.0 | 2277 | 20804.797980 | 17000.0 |
17.0 | 1275 | 22974.941176 | 18000.0 |
12.0 | 3610 | 24762.936288 | 18250.0 |
28.0 | 860 | 28312.674419 | 18750.0 |
20.0 | 2238 | 24737.811439 | 19250.0 |
99.0 | 12042 | 26958.806112 | 19500.0 |
13.0 | 3068 | 27017.441330 | 20500.0 |
16.0 | 1807 | 26261.427781 | 20500.0 |
27.0 | 1150 | 27606.739130 | 22500.0 |
4.0 | 20877 | 29825.726876 | 23000.0 |
23.0 | 1322 | 30613.729198 | 23000.0 |
2.0 | 57441 | 29023.051670 | 23000.0 |
10.0 | 5876 | 29561.546971 | 23000.0 |
0.0 | 536 | 29979.850746 | 23000.0 |
15.0 | 1742 | 29986.337543 | 23500.0 |
5.0 | 7002 | 29150.217795 | 23500.0 |
24.0 | 1357 | 33041.156964 | 24500.0 |
1.0 | 192773 | 32684.870075 | 25000.0 |
8.0 | 7419 | 32477.978838 | 25000.0 |
11.0 | 3823 | 32707.088674 | 25500.0 |
3.0 | 30288 | 33596.962592 | 26000.0 |
6.0 | 11950 | 34708.070628 | 27000.0 |
26.0 | 796 | 36157.412060 | 28000.0 |
7.0 | 7847 | 36288.450236 | 28500.0 |
19.0 | 2074 | 42715.766635 | 34000.0 |
# 画图查看
temp.sort_values(by=('median', 'SalePrice'))[('median', 'SalePrice')].plot(kind='bar')
plt.show()
1.4.3 YearMade
total_df['YearMade'] = total_df['YearMade'].astype(int)
total_df['YearMade'].value_counts()
1000 39391
2005 22096
1998 21751
2004 20914
1999 19274
...
2012 1
1949 1
1942 1
2013 1
1937 1
Name: YearMade, Length: 73, dtype: int64
plt.figure(figsize=(14,10))
sns.countplot(total_df['YearMade'])
plt.xticks(rotation=90)
plt.show()
第一台拖拉机1904年才发明出来,1904年前的属于异常数据。同时该数据集截至年份是2012年,大于2012的属于异常。异常值后续增加一个新的衍生变量YearMade_is_error区分。
生产时间大于销售时间的,暂时认为是提前订货,不处理。
1.4.4 MachineHoursCurrentMeter
# 查看枚举值
total_df['MachineHoursCurrentMeter'].value_counts()
0.0 73834
2000.0 124
1000.0 117
24.0 115
1500.0 101
...
10834.0 1
3499.0 1
26270.0 1
26901.0 1
17920.0 1
Name: MachineHoursCurrentMeter, Length: 15633, dtype: int64
1.4.5 UsageBand
# 查看枚举值
total_df['UsageBand'].value_counts()
Medium 35832
Low 25311
High 12527
Name: UsageBand, dtype: int64
1.4.6 fiBaseModel
total_df['fiBaseModel'].value_counts()
580 20179
310 17886
D6 13527
416 12900
D5 9636
...
56 1
B4230 1
IS30 1
MM555 1
WLK15 1
Name: fiBaseModel, Length: 1961, dtype: int64
1.4.7 fiSecondaryDesc
total_df['fiSecondaryDesc'].value_counts().head(10)
136420
C 44658
B 40446
G 38139
H 24759
E 21944
D 20132
F 9454
K 8089
A 5968
Name: fiSecondaryDesc, dtype: int64
temp = pd.pivot_table(total_df, index='fiSecondaryDesc', values='SalePrice', aggfunc