【Python机器学习】回归模型：推土机售价预测

原创

已于 2022-09-09 19:53:59 修改 · 6.5k 阅读

11 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习 #python #回归 #XGBOOST #逐步回归

于 2022-09-09 18:15:58 首次发布

文章目录

使用机器学习预测推土机的售价
零、导入模块
一、EDA
2. 数据清洗+数据预处理
3. 建模
- 3.1 RF
- 3.2 XGB
4. 特征筛选
5. 最终成果

使用机器学习预测推土机的售价

1. 定义问题

考虑到推土机的特性，利用过去的数据，我们能多大程度上预测它未来的价格？

2. 数据来源

kaggle：https://www.kaggle.com/competitions/bluebook-for-bulldozers/overview

The data for this competition is split into three parts:

Train.csv is the training set, which contains data through the end of 2011.
Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
Test.csv is the test set, which won’t be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.

3. 评价标准

The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.

更多信息：https://www.kaggle.com/competitions/bluebook-for-bulldozers/overview/evaluation

注意：多数回归模型的评价标准都是减小误差。比如这次的目标就是最小化RMSLE。

4. 使用的特征

特征过多，请自行进入kaggle项目主页查看。或点击如下谷歌表格链接：https://docs.google.com/spreadsheets/d/1EIbdGa4S_46USXgg0OHX5jgTc8ld9fTHwPyi_VOV1as/edit#gid=0

零、导入模块

# EDA
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
sns.set()
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
%config InlineBackend.figure_config = 'svg'
warnings.filterwarnings("ignore")

# 数据预处理
from sklearn.preprocessing import LabelEncoder

# sklearn模型
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

# 模型评估
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_log_error, mean_absolute_error, mean_squared_error, r2_score

一、EDA

bulldozer_df = pd.read_csv('bluebook-for-bulldozers/TrainAndValid.csv',
                           low_memory = False)
appendix_df = pd.read_csv('bluebook-for-bulldozers/Machine_Appendix.csv',
                           low_memory = False)

1.1 查看基本信息

# 查看各字段类型
bulldozer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412698 entries, 0 to 412697
Data columns (total 53 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   SalesID                   412698 non-null  int64  
 1   SalePrice                 412698 non-null  float64
 2   MachineID                 412698 non-null  int64  
 3   ModelID                   412698 non-null  int64  
 4   datasource                412698 non-null  int64  
 5   auctioneerID              392562 non-null  float64
 6   YearMade                  412698 non-null  int64  
 7   MachineHoursCurrentMeter  147504 non-null  float64
 8   UsageBand                 73670 non-null   object 
 9   saledate                  412698 non-null  object 
 10  fiModelDesc               412698 non-null  object 
 11  fiBaseModel               412698 non-null  object 
 12  fiSecondaryDesc           271971 non-null  object 
 13  fiModelSeries             58667 non-null   object 
 14  fiModelDescriptor         74816 non-null   object 
 15  ProductSize               196093 non-null  object 
 16  fiProductClassDesc        412698 non-null  object 
 17  state                     412698 non-null  object 
 18  ProductGroup              412698 non-null  object 
 19  ProductGroupDesc          412698 non-null  object 
 20  Drive_System              107087 non-null  object 
 21  Enclosure                 412364 non-null  object 
 22  Forks                     197715 non-null  object 
 23  Pad_Type                  81096 non-null   object 
 24  Ride_Control              152728 non-null  object 
 25  Stick                     81096 non-null   object 
 26  Transmission              188007 non-null  object 
 27  Turbocharged              81096 non-null   object 
 28  Blade_Extension           25983 non-null   object 
 29  Blade_Width               25983 non-null   object 
 30  Enclosure_Type            25983 non-null   object 
 31  Engine_Horsepower         25983 non-null   object 
 32  Hydraulics                330133 non-null  object 
 33  Pushblock                 25983 non-null   object 
 34  Ripper                    106945 non-null  object 
 35  Scarifier                 25994 non-null   object 
 36  Tip_Control               25983 non-null   object 
 37  Tire_Size                 97638 non-null   object 
 38  Coupler                   220679 non-null  object 
 39  Coupler_System            44974 non-null   object 
 40  Grouser_Tracks            44875 non-null   object 
 41  Hydraulics_Flow           44875 non-null   object 
 42  Track_Type                102193 non-null  object 
 43  Undercarriage_Pad_Width   102916 non-null  object 
 44  Stick_Length              102261 non-null  object 
 45  Thumb                     102332 non-null  object 
 46  Pattern_Changer           102261 non-null  object 
 47  Grouser_Type              102193 non-null  object 
 48  Backhoe_Mounting          80712 non-null   object 
 49  Blade_Type                81875 non-null   object 
 50  Travel_Controls           81877 non-null   object 
 51  Differential_Type         71564 non-null   object 
 52  Steering_Controls         71522 non-null   object 
dtypes: float64(3), int64(5), object(45)
memory usage: 166.9+ MB

# 查看缺失值
bulldozer_df.isna().sum()

SalesID                          0
SalePrice                        0
MachineID                        0
ModelID                          0
datasource                       0
auctioneerID                 20136
YearMade                         0
MachineHoursCurrentMeter    265194
UsageBand                   339028
saledate                         0
fiModelDesc                      0
fiBaseModel                      0
fiSecondaryDesc             140727
fiModelSeries               354031
fiModelDescriptor           337882
ProductSize                 216605
fiProductClassDesc               0
state                            0
ProductGroup                     0
ProductGroupDesc                 0
Drive_System                305611
Enclosure                      334
Forks                       214983
Pad_Type                    331602
Ride_Control                259970
Stick                       331602
Transmission                224691
Turbocharged                331602
Blade_Extension             386715
Blade_Width                 386715
Enclosure_Type              386715
Engine_Horsepower           386715
Hydraulics                   82565
Pushblock                   386715
Ripper                      305753
Scarifier                   386704
Tip_Control                 386715
Tire_Size                   315060
Coupler                     192019
Coupler_System              367724
Grouser_Tracks              367823
Hydraulics_Flow             367823
Track_Type                  310505
Undercarriage_Pad_Width     309782
Stick_Length                310437
Thumb                       310366
Pattern_Changer             310437
Grouser_Type                310505
Backhoe_Mounting            331986
Blade_Type                  330823
Travel_Controls             330821
Differential_Type           341134
Steering_Controls           341176
dtype: int64

# 查看标签分布
bulldozer_df['SalePrice'].hist()

<AxesSubplot:>

请添加图片描述

1.2 特征类型转换

# 改为帕斯卡命名
bulldozer_df.rename(columns={
   
   'saledate': 'SaleDate'}, inplace=True)

bulldozer_df['SaleDate'] = pd.to_datetime(bulldozer_df['SaleDate'])

# 按时间查看售价，只看最近几年的，没有明显的上升趋势，只有上下波动
plt.figure(figsize=(20,15))
pd.pivot_table(bulldozer_df[200000::1000], index='SaleDate', values='SalePrice').plot()
plt.show()

请添加图片描述

2005年前后半年和2008年整个一年销量都比较差，除此之外看不出太多信息

# 将数据集按时间排序
bulldozer_df.sort_values(by='SaleDate', inplace=True)
bulldozer_df['SaleDate'].head(20)

205615   1989-01-17
274835   1989-01-31
141296   1989-01-31
212552   1989-01-31
62755    1989-01-31
54653    1989-01-31
81383    1989-01-31
204924   1989-01-31
135376   1989-01-31
113390   1989-01-31
113394   1989-01-31
116419   1989-01-31
32138    1989-01-31
127610   1989-01-31
76171    1989-01-31
127000   1989-01-31
128130   1989-01-31
127626   1989-01-31
55455    1989-01-31
55454    1989-01-31
Name: SaleDate, dtype: datetime64[ns]

1.3 联表+特征初筛

这个数据集比较特殊，还有个appendix表，里面是一些推土机的配件信息，而且这个信息和train表有重复特征，重复特征里面还有匹配不上的情况，现在先联表上来看看

# 制作数据集副本，这是为了方便对数据集做了什么操作后，仍然可以获取原始数据，而不用从头读数据
bd_df = bulldozer_df.copy()
app_df = appendix_df.copy()

# SalesID列丢弃
bd_df.drop('SalesID', axis=1, inplace=True)

# 定义一个查看出入的函数
def check_difference(df1, df2, target_col, on_col):
    temp_df = pd.merge(df1[[on_col, target_col]], df2[[on_col, target_col]], how='left', on=on_col)
    return temp_df[(temp_df[target_col+'_x'] != temp_df[target_col+'_y'])]

# 定义一个合并时互补的函数，冲突时保留df1的数据
def combine(df1, df2, target_col, on_col):
    temp_df0 = pd.merge(df1[[on_col, target_col]], df2[[on_col, target_col]], how='left', on=on_col)
    temp_df0.fillna('', inplace=True)
    temp_df1=temp_df0[(temp_df0[target_col+'_x']== '') & (temp_df0[target_col+'_y']!='')]
    temp_df0[target_col+'_x'].loc[temp_df1.index] = temp_df1[target_col+'_y']
    df1[target_col] = temp_df0[target_col+'_x']
    df2.drop(columns=[target_col], inplace=True)
    return df1

1.3.1 删除包含重复信息的特征

ProductGroup是ProductGroupDesc的首字母缩写版，选择保留后者

fiManufacturerID和fiManufacturerDesc包含的信息一样，选择保留前者

bd_df.drop(columns=['ProductGroup', 'fiProductClassDesc'], inplace=True)
app_df.drop(columns=['ModelID', 'fiModelDesc', 'ProductGroup', 'fiManufacturerDesc'], inplace=True)

1.3.2 fiBaseModel

# 查看枚举值，后续需要做分箱合并处理
bd_df['fiBaseModel'].value_counts()

580      20179
310      17886
D6       13527
416      12900
D5        9636
         ...  
56           1
B4230        1
IS30         1
MM555        1
WLK15        1
Name: fiBaseModel, Length: 1961, dtype: int64

# 查看出入部分有无空值
check_difference(bd_df, app_df, 'fiBaseModel', 'MachineID').isna().sum()

MachineID        0
fiBaseModel_x    0
fiBaseModel_y    0
dtype: int64

# 查看出入部分
check_difference(bd_df, app_df, 'fiBaseModel', 'MachineID')

	MachineID	fiBaseModel_x	fiBaseModel_y
71	1523610	WA150	HD465
98	1059447	WA300	PC100
123	1303779	415	862
128	1340389	MS240	MS120
179	1208516	D31	PC100
...	...	...	...
412474	2308891	1845	75
412484	2287735	T135	T133
412509	2292146	450	465
412655	1846321	TB135	TB125
412695	1918416	337	530

12452 rows × 3 columns

# 选择保留bd_df表的数据
app_df.drop(columns=['fiBaseModel'], inplace=True)

1.3.3 fiSecondaryDesc

# 查看枚举值，后续需要做分箱合并处理
bd_df['fiSecondaryDesc'].value_counts()

C         44431
B         40165
G         37915
H         24729
E         21532
          ...  
BLGPPS        1
MSR           1
LC7A          1
CL            1
BH            1
Name: fiSecondaryDesc, Length: 177, dtype: int64

# 查看不一致部分
check_difference(bd_df, app_df, 'fiSecondaryDesc', 'MachineID')

	MachineID	fiSecondaryDesc_x	fiSecondaryDesc_y
0	1126363	NaN	NaN
1	1194089	NaN	NaN
3	1327630	NaN	NaN
6	1082797	NaN	NaN
7	1527216	NaN	NaN
...	...	...	...
412690	1823846	NaN	NaN
412691	1278794	NaN	NaN
412692	1792049	NaN	NaN
412694	1919104	NaN	NaN
412695	1918416	G	NaN

147009 rows × 3 columns

# 合并互补
bd_df = combine(bd_df, app_df, 'fiSecondaryDesc', 'MachineID')

1.3.4 fiModelSeries

# 查看枚举值，后续需要做分箱合并处理
bd_df['fiModelSeries'].value_counts()

II      13770
LC       9175
III      5351
-1       4646
-2       4033
        ...  
LL          1
6F          1
-2LC        1
-5A         1
VII         1
Name: fiModelSeries, Length: 123, dtype: int64

# 查看不一致部分
check_difference(bd_df, app_df, 'fiModelSeries', 'MachineID')

	MachineID	fiModelSeries_x	fiModelSeries_y
0	1126363	NaN	NaN
1	1194089	NaN	NaN
2	1473654	NaN	NaN
3	1327630	NaN	NaN
4	1336053	NaN	NaN
...	...	...	...
412693	1915521	NaN	NaN
412694	1919104	NaN	NaN
412695	1918416	NaN	NaN
412696	509560	NaN	NaN
412697	1869284	NaN	NaN

362366 rows × 3 columns

# 合并互补
bd_df = combine(bd_df, app_df, 'fiModelSeries', 'MachineID')

1.3.5 fiModelDescriptor

# 查看枚举值
bd_df['fiModelDescriptor'].value_counts()

L            16464
LGP          16143
LC           13295
XL            6700
6             2944
             ...  
K5               1
HighLift         1
High Lift        1
III              1
SL               1
Name: fiModelDescriptor, Length: 140, dtype: int64

# 查看不一致部分
check_difference(bd_df, app_df, 'fiModelDescriptor', 'MachineID')

	MachineID	fiModelDescriptor_x	fiModelDescriptor_y
0	1126363	NaN	NaN
1	1194089	NaN	NaN
2	1473654	NaN	NaN
3	1327630	NaN	NaN
4	1336053	NaN	NaN
...	...	...	...
412693	1915521	NaN	NaN
412694	1919104	NaN	NaN
412695	1918416	NaN	NaN
412696	509560	NaN	NaN
412697	1869284	NaN	NaN

342069 rows × 3 columns

# 合并互补
bd_df = combine(bd_df, app_df, 'fiModelDescriptor', 'MachineID')

1.3.6 ProductGroupDesc

# 查看枚举值
bd_df['ProductGroupDesc'].value_counts()

Track Excavators       104230
Track Type Tractors     82582
Backhoe Loaders         81401
Wheel Loader            73216
Skid Steer Loaders      45011
Motor Graders           26258
Name: ProductGroupDesc, dtype: int64

# 枚举值太多了，而且bd_df里的数据并无空值，这里就选择不互补了，直接保留bd_df的数据
app_df['ProductGroupDesc'].value_counts()

Track Excavators                 89094
Backhoe Loaders                  74074
Track Type Tractors              67362
Wheel Loader                     62426
Skid Steer Loaders               42121
Motor Graders                    22602
Track Loaders                      158
Articulated Trucks                 109
Ag Tractors                        103
Wheel Tractor Scraper              102
Off Highway Trucks                  81
Multi Terrain Loaders               66
Wheel Excavator                     66
Forklift                            53
Skidders                            46
Wheel Feller Buncher                31
Forestry Log Loaders                25
Pipelayers                          11
Telehandler                          9
Wheel Dozer                          7
Knuckleboom Loaders                  6
Track Feller Bunchers                6
Vibratory Double Drum Asphalt        4
Vibratory Single Drum Asphalt        4
Work Tool                            3
Pneumatic Tired Compactor            3
Compactors                           3
Vibratory Single Drum Pad            3
Tandem Roller Static                 2
Harvesters                           2
Forwarders                           2
Engine, Industrial OEM               2
Track Harvesters                     1
Delimber Forestry                    1
Asphalt/Concrete Pavers              1
Vibratory Single Drum Smooth         1
Crane/Dragline                       1
Forestry Excavators                  1
Cold Planers                         1
Name: ProductGroupDesc, dtype: int64

app_df.drop(columns=['ProductGroupDesc'], inplace=True)

1.3.7 MfgYear

# 两边同一个特征名字不一样，先改名
app_df.rename(columns={
   
   'MfgYear': 'YearMade'}, inplace=True)

# 查看枚举值，发现异常值1000
bd_df['YearMade'].value_counts()

1000    39391
2005    22096
1998    21751
2004    20914
1999    19274
        ...  
2012        1
1949        1
1942        1
2013        1
1937        1
Name: YearMade, Length: 73, dtype: int64

# 查看不一致部分
check_difference(bd_df, app_df, 'YearMade', 'MachineID')

	MachineID	YearMade_x	YearMade_y
52	1531656	1975	1987.0
58	1078132	1967	1966.0
71	1523610	1986	2007.0
98	1059447	1984	1996.0
122	1525180	1984	1985.0
...	...	...	...
412651	1631729	1000	1990.0
412654	1912106	1000	1996.0
412655	1846321	1000	2005.0
412667	1897535	1998	1999.0
412696	509560	1993	1990.0

36406 rows × 3 columns

这种不一致通常是因为bd_df里面有重复的MachineID，即同一种推土机被卖了多次，而app_df里只有唯一MachineID所导致，决定保留bd_df的数据，不动它

app_df.drop('YearMade', axis=1, inplace=True)

1.3.8 fiManufacturerID、PrimarySizeBasis、PrimaryLower、PrimaryUpper

这三个特征原训练集上都没有，直接联过去

# total_df = bd_df.copy()

total_df = pd.merge(bd_df, app_df, how='left', on='MachineID')

# 动力上限和下限留一个就足够区分了，取下限
total_df.drop(columns=['PrimaryUpper'], inplace=True)

total_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 412698 entries, 0 to 412697
Data columns (total 54 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   SalePrice                 412698 non-null  float64       
 1   MachineID                 412698 non-null  int64         
 2   ModelID                   412698 non-null  int64         
 3   datasource                412698 non-null  int64         
 4   auctioneerID              392562 non-null  float64       
 5   YearMade                  412698 non-null  int64         
 6   MachineHoursCurrentMeter  147504 non-null  float64       
 7   UsageBand                 73670 non-null   object        
 8   SaleDate                  412698 non-null  datetime64[ns]
 9   fiModelDesc               412698 non-null  object        
 10  fiBaseModel               412698 non-null  object        
 11  fiSecondaryDesc           412698 non-null  object        
 12  fiModelSeries             412698 non-null  object        
 13  fiModelDescriptor         412698 non-null  object        
 14  ProductSize               196093 non-null  object        
 15  state                     412698 non-null  object        
 16  ProductGroupDesc          412698 non-null  object        
 17  Drive_System              107087 non-null  object        
 18  Enclosure                 412364 non-null  object        
 19  Forks                     197715 non-null  object        
 20  Pad_Type                  81096 non-null   object        
 21  Ride_Control              152728 non-null  object        
 22  Stick                     81096 non-null   object        
 23  Transmission              188007 non-null  object        
 24  Turbocharged              81096 non-null   object        
 25  Blade_Extension           25983 non-null   object        
 26  Blade_Width               25983 non-null   object        
 27  Enclosure_Type            25983 non-null   object        
 28  Engine_Horsepower         25983 non-null   object        
 29  Hydraulics                330133 non-null  object        
 30  Pushblock                 25983 non-null   object        
 31  Ripper                    106945 non-null  object        
 32  Scarifier                 25994 non-null   object        
 33  Tip_Control               25983 non-null   object        
 34  Tire_Size                 97638 non-null   object        
 35  Coupler                   220679 non-null  object        
 36  Coupler_System            44974 non-null   object        
 37  Grouser_Tracks            44875 non-null   object        
 38  Hydraulics_Flow           44875 non-null   object        
 39  Track_Type                102193 non-null  object        
 40  Undercarriage_Pad_Width   102916 non-null  object        
 41  Stick_Length              102261 non-null  object        
 42  Thumb                     102332 non-null  object        
 43  Pattern_Changer           102261 non-null  object        
 44  Grouser_Type              102193 non-null  object        
 45  Backhoe_Mounting          80712 non-null   object        
 46  Blade_Type                81875 non-null   object        
 47  Travel_Controls           81877 non-null   object        
 48  Differential_Type         71564 non-null   object        
 49  Steering_Controls         71522 non-null   object        
 50  fiProductClassDesc        412698 non-null  object        
 51  fiManufacturerID          412698 non-null  int64         
 52  PrimarySizeBasis          407439 non-null  object        
 53  PrimaryLower              407439 non-null  float64       
dtypes: datetime64[ns](1), float64(4), int64(5), object(44)
memory usage: 173.2+ MB

raise KeyError

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

Input In [903], in <cell line: 1>()
----> 1 raise KeyError


KeyError:

1.4 逐个查看特征

1.4.1 datasource

# 173怀疑是172写错成173
total_df['datasource'].value_counts()

132    260776
136     75491
149     33325
121     25191
172     17914
173         1
Name: datasource, dtype: int64

1.4.2 auctioneerID

# 后续要合并
total_df['auctioneerID'].value_counts()

1.0     192773
2.0      57441
3.0      30288
4.0      20877
99.0     12042
6.0      11950
7.0       7847
8.0       7419
5.0       7002
10.0      5876
9.0       4764
11.0      3823
12.0      3610
13.0      3068
18.0      2359
14.0      2277
20.0      2238
19.0      2074
16.0      1807
15.0      1742
21.0      1601
22.0      1429
24.0      1357
23.0      1322
17.0      1275
27.0      1150
25.0       959
28.0       860
26.0       796
0.0        536
Name: auctioneerID, dtype: int64

# 查看auctioneerID和价格的关系
temp = pd.pivot_table(total_df, index='auctioneerID', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('median', 'SalePrice'))

	count	mean	median
	SalePrice	SalePrice	SalePrice
auctioneerID
22.0	1429	18230.020994	14000.0
18.0	2359	18839.799491	14500.0
21.0	1601	19645.315428	15000.0
25.0	959	22478.519291	16000.0
9.0	4764	22515.003149	16775.0
14.0	2277	20804.797980	17000.0
17.0	1275	22974.941176	18000.0
12.0	3610	24762.936288	18250.0
28.0	860	28312.674419	18750.0
20.0	2238	24737.811439	19250.0
99.0	12042	26958.806112	19500.0
13.0	3068	27017.441330	20500.0
16.0	1807	26261.427781	20500.0
27.0	1150	27606.739130	22500.0
4.0	20877	29825.726876	23000.0
23.0	1322	30613.729198	23000.0
2.0	57441	29023.051670	23000.0
10.0	5876	29561.546971	23000.0
0.0	536	29979.850746	23000.0
15.0	1742	29986.337543	23500.0
5.0	7002	29150.217795	23500.0
24.0	1357	33041.156964	24500.0
1.0	192773	32684.870075	25000.0
8.0	7419	32477.978838	25000.0
11.0	3823	32707.088674	25500.0
3.0	30288	33596.962592	26000.0
6.0	11950	34708.070628	27000.0
26.0	796	36157.412060	28000.0
7.0	7847	36288.450236	28500.0
19.0	2074	42715.766635	34000.0

# 画图查看
temp.sort_values(by=('median', 'SalePrice'))[('median', 'SalePrice')].plot(kind='bar')
plt.show()

请添加图片描述

1.4.3 YearMade

total_df['YearMade'] = total_df['YearMade'].astype(int)

total_df['YearMade'].value_counts()

1000    39391
2005    22096
1998    21751
2004    20914
1999    19274
        ...  
2012        1
1949        1
1942        1
2013        1
1937        1
Name: YearMade, Length: 73, dtype: int64

plt.figure(figsize=(14,10))
sns.countplot(total_df['YearMade'])
plt.xticks(rotation=90)
plt.show()

请添加图片描述

第一台拖拉机1904年才发明出来，1904年前的属于异常数据。同时该数据集截至年份是2012年，大于2012的属于异常。异常值后续增加一个新的衍生变量YearMade_is_error区分。

生产时间大于销售时间的，暂时认为是提前订货，不处理。

1.4.4 MachineHoursCurrentMeter

# 查看枚举值
total_df['MachineHoursCurrentMeter'].value_counts()

0.0        73834
2000.0       124
1000.0       117
24.0         115
1500.0       101
           ...  
10834.0        1
3499.0         1
26270.0        1
26901.0        1
17920.0        1
Name: MachineHoursCurrentMeter, Length: 15633, dtype: int64

1.4.5 UsageBand

# 查看枚举值
total_df['UsageBand'].value_counts()

Medium    35832
Low       25311
High      12527
Name: UsageBand, dtype: int64

1.4.6 fiBaseModel

total_df['fiBaseModel'].value_counts()

580      20179
310      17886
D6       13527
416      12900
D5        9636
         ...  
56           1
B4230        1
IS30         1
MM555        1
WLK15        1
Name: fiBaseModel, Length: 1961, dtype: int64

1.4.7 fiSecondaryDesc

total_df['fiSecondaryDesc'].value_counts().head(10)

     136420
C     44658
B     40446
G     38139
H     24759
E     21944
D     20132
F      9454
K      8089
A      5968
Name: fiSecondaryDesc, dtype: int64

temp = pd.pivot_table(total_df, index='fiSecondaryDesc', values='SalePrice', aggfunc=

最低0.47元/天解锁文章