【Python机器学习】回归模型:推土机售价预测

文章目录

使用机器学习预测推土机的售价

1. 定义问题

考虑到推土机的特性,利用过去的数据,我们能多大程度上预测它未来的价格?

2. 数据来源

kaggle:https://www.kaggle.com/competitions/bluebook-for-bulldozers/overview

The data for this competition is split into three parts:

  • Train.csv is the training set, which contains data through the end of 2011.
  • Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
  • Test.csv is the test set, which won’t be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.

3. 评价标准

The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.

更多信息:https://www.kaggle.com/competitions/bluebook-for-bulldozers/overview/evaluation

注意:多数回归模型的评价标准都是减小误差。比如这次的目标就是最小化RMSLE。

4. 使用的特征

特征过多,请自行进入kaggle项目主页查看。或点击如下谷歌表格链接:https://docs.google.com/spreadsheets/d/1EIbdGa4S_46USXgg0OHX5jgTc8ld9fTHwPyi_VOV1as/edit#gid=0

零、导入模块

# EDA
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
sns.set()
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
%config InlineBackend.figure_config = 'svg'
warnings.filterwarnings("ignore")

# 数据预处理
from sklearn.preprocessing import LabelEncoder

# sklearn模型
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

# 模型评估
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_log_error, mean_absolute_error, mean_squared_error, r2_score

一、EDA

bulldozer_df = pd.read_csv('bluebook-for-bulldozers/TrainAndValid.csv',
                           low_memory = False)
appendix_df = pd.read_csv('bluebook-for-bulldozers/Machine_Appendix.csv',
                           low_memory = False)

1.1 查看基本信息

# 查看各字段类型
bulldozer_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412698 entries, 0 to 412697
Data columns (total 53 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   SalesID                   412698 non-null  int64  
 1   SalePrice                 412698 non-null  float64
 2   MachineID                 412698 non-null  int64  
 3   ModelID                   412698 non-null  int64  
 4   datasource                412698 non-null  int64  
 5   auctioneerID              392562 non-null  float64
 6   YearMade                  412698 non-null  int64  
 7   MachineHoursCurrentMeter  147504 non-null  float64
 8   UsageBand                 73670 non-null   object 
 9   saledate                  412698 non-null  object 
 10  fiModelDesc               412698 non-null  object 
 11  fiBaseModel               412698 non-null  object 
 12  fiSecondaryDesc           271971 non-null  object 
 13  fiModelSeries             58667 non-null   object 
 14  fiModelDescriptor         74816 non-null   object 
 15  ProductSize               196093 non-null  object 
 16  fiProductClassDesc        412698 non-null  object 
 17  state                     412698 non-null  object 
 18  ProductGroup              412698 non-null  object 
 19  ProductGroupDesc          412698 non-null  object 
 20  Drive_System              107087 non-null  object 
 21  Enclosure                 412364 non-null  object 
 22  Forks                     197715 non-null  object 
 23  Pad_Type                  81096 non-null   object 
 24  Ride_Control              152728 non-null  object 
 25  Stick                     81096 non-null   object 
 26  Transmission              188007 non-null  object 
 27  Turbocharged              81096 non-null   object 
 28  Blade_Extension           25983 non-null   object 
 29  Blade_Width               25983 non-null   object 
 30  Enclosure_Type            25983 non-null   object 
 31  Engine_Horsepower         25983 non-null   object 
 32  Hydraulics                330133 non-null  object 
 33  Pushblock                 25983 non-null   object 
 34  Ripper                    106945 non-null  object 
 35  Scarifier                 25994 non-null   object 
 36  Tip_Control               25983 non-null   object 
 37  Tire_Size                 97638 non-null   object 
 38  Coupler                   220679 non-null  object 
 39  Coupler_System            44974 non-null   object 
 40  Grouser_Tracks            44875 non-null   object 
 41  Hydraulics_Flow           44875 non-null   object 
 42  Track_Type                102193 non-null  object 
 43  Undercarriage_Pad_Width   102916 non-null  object 
 44  Stick_Length              102261 non-null  object 
 45  Thumb                     102332 non-null  object 
 46  Pattern_Changer           102261 non-null  object 
 47  Grouser_Type              102193 non-null  object 
 48  Backhoe_Mounting          80712 non-null   object 
 49  Blade_Type                81875 non-null   object 
 50  Travel_Controls           81877 non-null   object 
 51  Differential_Type         71564 non-null   object 
 52  Steering_Controls         71522 non-null   object 
dtypes: float64(3), int64(5), object(45)
memory usage: 166.9+ MB
# 查看缺失值
bulldozer_df.isna().sum()
SalesID                          0
SalePrice                        0
MachineID                        0
ModelID                          0
datasource                       0
auctioneerID                 20136
YearMade                         0
MachineHoursCurrentMeter    265194
UsageBand                   339028
saledate                         0
fiModelDesc                      0
fiBaseModel                      0
fiSecondaryDesc             140727
fiModelSeries               354031
fiModelDescriptor           337882
ProductSize                 216605
fiProductClassDesc               0
state                            0
ProductGroup                     0
ProductGroupDesc                 0
Drive_System                305611
Enclosure                      334
Forks                       214983
Pad_Type                    331602
Ride_Control                259970
Stick                       331602
Transmission                224691
Turbocharged                331602
Blade_Extension             386715
Blade_Width                 386715
Enclosure_Type              386715
Engine_Horsepower           386715
Hydraulics                   82565
Pushblock                   386715
Ripper                      305753
Scarifier                   386704
Tip_Control                 386715
Tire_Size                   315060
Coupler                     192019
Coupler_System              367724
Grouser_Tracks              367823
Hydraulics_Flow             367823
Track_Type                  310505
Undercarriage_Pad_Width     309782
Stick_Length                310437
Thumb                       310366
Pattern_Changer             310437
Grouser_Type                310505
Backhoe_Mounting            331986
Blade_Type                  330823
Travel_Controls             330821
Differential_Type           341134
Steering_Controls           341176
dtype: int64
# 查看标签分布
bulldozer_df['SalePrice'].hist()
<AxesSubplot:>​    

请添加图片描述

1.2 特征类型转换

# 改为帕斯卡命名
bulldozer_df.rename(columns={
   
   'saledate': 'SaleDate'}, inplace=True)
bulldozer_df['SaleDate'] = pd.to_datetime(bulldozer_df['SaleDate'])
# 按时间查看售价,只看最近几年的,没有明显的上升趋势,只有上下波动
plt.figure(figsize=(20,15))
pd.pivot_table(bulldozer_df[200000::1000], index='SaleDate', values='SalePrice').plot()
plt.show()

请添加图片描述

2005年前后半年和2008年整个一年销量都比较差,除此之外看不出太多信息

# 将数据集按时间排序
bulldozer_df.sort_values(by='SaleDate', inplace=True)
bulldozer_df['SaleDate'].head(20)
205615   1989-01-17
274835   1989-01-31
141296   1989-01-31
212552   1989-01-31
62755    1989-01-31
54653    1989-01-31
81383    1989-01-31
204924   1989-01-31
135376   1989-01-31
113390   1989-01-31
113394   1989-01-31
116419   1989-01-31
32138    1989-01-31
127610   1989-01-31
76171    1989-01-31
127000   1989-01-31
128130   1989-01-31
127626   1989-01-31
55455    1989-01-31
55454    1989-01-31
Name: SaleDate, dtype: datetime64[ns]

1.3 联表+特征初筛

这个数据集比较特殊,还有个appendix表,里面是一些推土机的配件信息,而且这个信息和train表有重复特征,重复特征里面还有匹配不上的情况,现在先联表上来看看

# 制作数据集副本,这是为了方便对数据集做了什么操作后,仍然可以获取原始数据,而不用从头读数据
bd_df = bulldozer_df.copy()
app_df = appendix_df.copy()
# SalesID列丢弃
bd_df.drop('SalesID', axis=1, inplace=True)
# 定义一个查看出入的函数
def check_difference(df1, df2, target_col, on_col):
    temp_df = pd.merge(df1[[on_col, target_col]], df2[[on_col, target_col]], how='left', on=on_col)
    return temp_df[(temp_df[target_col+'_x'] != temp_df[target_col+'_y'])]
# 定义一个合并时互补的函数,冲突时保留df1的数据
def combine(df1, df2, target_col, on_col):
    temp_df0 = pd.merge(df1[[on_col, target_col]], df2[[on_col, target_col]], how='left', on=on_col)
    temp_df0.fillna('', inplace=True)
    temp_df1=temp_df0[(temp_df0[target_col+'_x']== '') & (temp_df0[target_col+'_y']!='')]
    temp_df0[target_col+'_x'].loc[temp_df1.index] = temp_df1[target_col+'_y']
    df1[target_col] = temp_df0[target_col+'_x']
    df2.drop(columns=[target_col], inplace=True)
    return df1

1.3.1 删除包含重复信息的特征

ProductGroup是ProductGroupDesc的首字母缩写版,选择保留后者

fiManufacturerID和fiManufacturerDesc包含的信息一样,选择保留前者

bd_df.drop(columns=['ProductGroup', 'fiProductClassDesc'], inplace=True)
app_df.drop(columns=['ModelID', 'fiModelDesc', 'ProductGroup', 'fiManufacturerDesc'], inplace=True)

1.3.2 fiBaseModel

# 查看枚举值,后续需要做分箱合并处理
bd_df['fiBaseModel'].value_counts()
580      20179
310      17886
D6       13527
416      12900
D5        9636
         ...  
56           1
B4230        1
IS30         1
MM555        1
WLK15        1
Name: fiBaseModel, Length: 1961, dtype: int64
# 查看出入部分有无空值
check_difference(bd_df, app_df, 'fiBaseModel', 'MachineID').isna().sum()
MachineID        0
fiBaseModel_x    0
fiBaseModel_y    0
dtype: int64
# 查看出入部分
check_difference(bd_df, app_df, 'fiBaseModel', 'MachineID')
MachineID fiBaseModel_x fiBaseModel_y
71 1523610 WA150 HD465
98 1059447 WA300 PC100
123 1303779 415 862
128 1340389 MS240 MS120
179 1208516 D31 PC100
... ... ... ...
412474 2308891 1845 75
412484 2287735 T135 T133
412509 2292146 450 465
412655 1846321 TB135 TB125
412695 1918416 337 530

12452 rows × 3 columns

# 选择保留bd_df表的数据
app_df.drop(columns=['fiBaseModel'], inplace=True)

1.3.3 fiSecondaryDesc

# 查看枚举值,后续需要做分箱合并处理
bd_df['fiSecondaryDesc'].value_counts()
C         44431
B         40165
G         37915
H         24729
E         21532
          ...  
BLGPPS        1
MSR           1
LC7A          1
CL            1
BH            1
Name: fiSecondaryDesc, Length: 177, dtype: int64
# 查看不一致部分
check_difference(bd_df, app_df, 'fiSecondaryDesc', 'MachineID')
MachineID fiSecondaryDesc_x fiSecondaryDesc_y
0 1126363 NaN NaN
1 1194089 NaN NaN
3 1327630 NaN NaN
6 1082797 NaN NaN
7 1527216 NaN NaN
... ... ... ...
412690 1823846 NaN NaN
412691 1278794 NaN NaN
412692 1792049 NaN NaN
412694 1919104 NaN NaN
412695 1918416 G NaN

147009 rows × 3 columns

# 合并互补
bd_df = combine(bd_df, app_df, 'fiSecondaryDesc', 'MachineID')

1.3.4 fiModelSeries

# 查看枚举值,后续需要做分箱合并处理
bd_df['fiModelSeries'].value_counts()
II      13770
LC       9175
III      5351
-1       4646
-2       4033
        ...  
LL          1
6F          1
-2LC        1
-5A         1
VII         1
Name: fiModelSeries, Length: 123, dtype: int64
# 查看不一致部分
check_difference(bd_df, app_df, 'fiModelSeries', 'MachineID')
MachineID fiModelSeries_x fiModelSeries_y
0 1126363 NaN NaN
1 1194089 NaN NaN
2 1473654 NaN NaN
3 1327630 NaN NaN
4 1336053 NaN NaN
... ... ... ...
412693 1915521 NaN NaN
412694 1919104 NaN NaN
412695 1918416 NaN NaN
412696 509560 NaN NaN
412697 1869284 NaN NaN

362366 rows × 3 columns

# 合并互补
bd_df = combine(bd_df, app_df, 'fiModelSeries', 'MachineID')

1.3.5 fiModelDescriptor

# 查看枚举值
bd_df['fiModelDescriptor'].value_counts()
L            16464
LGP          16143
LC           13295
XL            6700
6             2944
             ...  
K5               1
HighLift         1
High Lift        1
III              1
SL               1
Name: fiModelDescriptor, Length: 140, dtype: int64
# 查看不一致部分
check_difference(bd_df, app_df, 'fiModelDescriptor', 'MachineID')
MachineID fiModelDescriptor_x fiModelDescriptor_y
0 1126363 NaN NaN
1 1194089 NaN NaN
2 1473654 NaN NaN
3 1327630 NaN NaN
4 1336053 NaN NaN
... ... ... ...
412693 1915521 NaN NaN
412694 1919104 NaN NaN
412695 1918416 NaN NaN
412696 509560 NaN NaN
412697 1869284 NaN NaN

342069 rows × 3 columns

# 合并互补
bd_df = combine(bd_df, app_df, 'fiModelDescriptor', 'MachineID')

1.3.6 ProductGroupDesc

# 查看枚举值
bd_df['ProductGroupDesc'].value_counts()
Track Excavators       104230
Track Type Tractors     82582
Backhoe Loaders         81401
Wheel Loader            73216
Skid Steer Loaders      45011
Motor Graders           26258
Name: ProductGroupDesc, dtype: int64
# 枚举值太多了,而且bd_df里的数据并无空值,这里就选择不互补了,直接保留bd_df的数据
app_df['ProductGroupDesc'].value_counts()
Track Excavators                 89094
Backhoe Loaders                  74074
Track Type Tractors              67362
Wheel Loader                     62426
Skid Steer Loaders               42121
Motor Graders                    22602
Track Loaders                      158
Articulated Trucks                 109
Ag Tractors                        103
Wheel Tractor Scraper              102
Off Highway Trucks                  81
Multi Terrain Loaders               66
Wheel Excavator                     66
Forklift                            53
Skidders                            46
Wheel Feller Buncher                31
Forestry Log Loaders                25
Pipelayers                          11
Telehandler                          9
Wheel Dozer                          7
Knuckleboom Loaders                  6
Track Feller Bunchers                6
Vibratory Double Drum Asphalt        4
Vibratory Single Drum Asphalt        4
Work Tool                            3
Pneumatic Tired Compactor            3
Compactors                           3
Vibratory Single Drum Pad            3
Tandem Roller Static                 2
Harvesters                           2
Forwarders                           2
Engine, Industrial OEM               2
Track Harvesters                     1
Delimber Forestry                    1
Asphalt/Concrete Pavers              1
Vibratory Single Drum Smooth         1
Crane/Dragline                       1
Forestry Excavators                  1
Cold Planers                         1
Name: ProductGroupDesc, dtype: int64
app_df.drop(columns=['ProductGroupDesc'], inplace=True)

1.3.7 MfgYear

# 两边同一个特征名字不一样,先改名
app_df.rename(columns={
   
   'MfgYear': 'YearMade'}, inplace=True)
# 查看枚举值,发现异常值1000
bd_df['YearMade'].value_counts()
1000    39391
2005    22096
1998    21751
2004    20914
1999    19274
        ...  
2012        1
1949        1
1942        1
2013        1
1937        1
Name: YearMade, Length: 73, dtype: int64
# 查看不一致部分
check_difference(bd_df, app_df, 'YearMade', 'MachineID')
MachineID YearMade_x YearMade_y
52 1531656 1975 1987.0
58 1078132 1967 1966.0
71 1523610 1986 2007.0
98 1059447 1984 1996.0
122 1525180 1984 1985.0
... ... ... ...
412651 1631729 1000 1990.0
412654 1912106 1000 1996.0
412655 1846321 1000 2005.0
412667 1897535 1998 1999.0
412696 509560 1993 1990.0

36406 rows × 3 columns

这种不一致通常是因为bd_df里面有重复的MachineID,即同一种推土机被卖了多次,而app_df里只有唯一MachineID所导致,决定保留bd_df的数据,不动它

app_df.drop('YearMade', axis=1, inplace=True)

1.3.8 fiManufacturerID、PrimarySizeBasis、PrimaryLower、PrimaryUpper

这三个特征原训练集上都没有,直接联过去

# total_df = bd_df.copy()
total_df = pd.merge(bd_df, app_df, how='left', on='MachineID')
# 动力上限和下限留一个就足够区分了,取下限
total_df.drop(columns=['PrimaryUpper'], inplace=True)
total_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 412698 entries, 0 to 412697
Data columns (total 54 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   SalePrice                 412698 non-null  float64       
 1   MachineID                 412698 non-null  int64         
 2   ModelID                   412698 non-null  int64         
 3   datasource                412698 non-null  int64         
 4   auctioneerID              392562 non-null  float64       
 5   YearMade                  412698 non-null  int64         
 6   MachineHoursCurrentMeter  147504 non-null  float64       
 7   UsageBand                 73670 non-null   object        
 8   SaleDate                  412698 non-null  datetime64[ns]
 9   fiModelDesc               412698 non-null  object        
 10  fiBaseModel               412698 non-null  object        
 11  fiSecondaryDesc           412698 non-null  object        
 12  fiModelSeries             412698 non-null  object        
 13  fiModelDescriptor         412698 non-null  object        
 14  ProductSize               196093 non-null  object        
 15  state                     412698 non-null  object        
 16  ProductGroupDesc          412698 non-null  object        
 17  Drive_System              107087 non-null  object        
 18  Enclosure                 412364 non-null  object        
 19  Forks                     197715 non-null  object        
 20  Pad_Type                  81096 non-null   object        
 21  Ride_Control              152728 non-null  object        
 22  Stick                     81096 non-null   object        
 23  Transmission              188007 non-null  object        
 24  Turbocharged              81096 non-null   object        
 25  Blade_Extension           25983 non-null   object        
 26  Blade_Width               25983 non-null   object        
 27  Enclosure_Type            25983 non-null   object        
 28  Engine_Horsepower         25983 non-null   object        
 29  Hydraulics                330133 non-null  object        
 30  Pushblock                 25983 non-null   object        
 31  Ripper                    106945 non-null  object        
 32  Scarifier                 25994 non-null   object        
 33  Tip_Control               25983 non-null   object        
 34  Tire_Size                 97638 non-null   object        
 35  Coupler                   220679 non-null  object        
 36  Coupler_System            44974 non-null   object        
 37  Grouser_Tracks            44875 non-null   object        
 38  Hydraulics_Flow           44875 non-null   object        
 39  Track_Type                102193 non-null  object        
 40  Undercarriage_Pad_Width   102916 non-null  object        
 41  Stick_Length              102261 non-null  object        
 42  Thumb                     102332 non-null  object        
 43  Pattern_Changer           102261 non-null  object        
 44  Grouser_Type              102193 non-null  object        
 45  Backhoe_Mounting          80712 non-null   object        
 46  Blade_Type                81875 non-null   object        
 47  Travel_Controls           81877 non-null   object        
 48  Differential_Type         71564 non-null   object        
 49  Steering_Controls         71522 non-null   object        
 50  fiProductClassDesc        412698 non-null  object        
 51  fiManufacturerID          412698 non-null  int64         
 52  PrimarySizeBasis          407439 non-null  object        
 53  PrimaryLower              407439 non-null  float64       
dtypes: datetime64[ns](1), float64(4), int64(5), object(44)
memory usage: 173.2+ MB
raise KeyError
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

Input In [903], in <cell line: 1>()
----> 1 raise KeyError


KeyError: 

1.4 逐个查看特征

1.4.1 datasource

# 173怀疑是172写错成173
total_df['datasource'].value_counts()
132    260776
136     75491
149     33325
121     25191
172     17914
173         1
Name: datasource, dtype: int64

1.4.2 auctioneerID

# 后续要合并
total_df['auctioneerID'].value_counts()
1.0     192773
2.0      57441
3.0      30288
4.0      20877
99.0     12042
6.0      11950
7.0       7847
8.0       7419
5.0       7002
10.0      5876
9.0       4764
11.0      3823
12.0      3610
13.0      3068
18.0      2359
14.0      2277
20.0      2238
19.0      2074
16.0      1807
15.0      1742
21.0      1601
22.0      1429
24.0      1357
23.0      1322
17.0      1275
27.0      1150
25.0       959
28.0       860
26.0       796
0.0        536
Name: auctioneerID, dtype: int64
# 查看auctioneerID和价格的关系
temp = pd.pivot_table(total_df, index='auctioneerID', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('median', 'SalePrice'))
count mean median
SalePrice SalePrice SalePrice
auctioneerID
22.0 1429 18230.020994 14000.0
18.0 2359 18839.799491 14500.0
21.0 1601 19645.315428 15000.0
25.0 959 22478.519291 16000.0
9.0 4764 22515.003149 16775.0
14.0 2277 20804.797980 17000.0
17.0 1275 22974.941176 18000.0
12.0 3610 24762.936288 18250.0
28.0 860 28312.674419 18750.0
20.0 2238 24737.811439 19250.0
99.0 12042 26958.806112 19500.0
13.0 3068 27017.441330 20500.0
16.0 1807 26261.427781 20500.0
27.0 1150 27606.739130 22500.0
4.0 20877 29825.726876 23000.0
23.0 1322 30613.729198 23000.0
2.0 57441 29023.051670 23000.0
10.0 5876 29561.546971 23000.0
0.0 536 29979.850746 23000.0
15.0 1742 29986.337543 23500.0
5.0 7002 29150.217795 23500.0
24.0 1357 33041.156964 24500.0
1.0 192773 32684.870075 25000.0
8.0 7419 32477.978838 25000.0
11.0 3823 32707.088674 25500.0
3.0 30288 33596.962592 26000.0
6.0 11950 34708.070628 27000.0
26.0 796 36157.412060 28000.0
7.0 7847 36288.450236 28500.0
19.0 2074 42715.766635 34000.0
# 画图查看
temp.sort_values(by=('median', 'SalePrice'))[('median', 'SalePrice')].plot(kind='bar')
plt.show()

请添加图片描述

1.4.3 YearMade

total_df['YearMade'] = total_df['YearMade'].astype(int)
total_df['YearMade'].value_counts()
1000    39391
2005    22096
1998    21751
2004    20914
1999    19274
        ...  
2012        1
1949        1
1942        1
2013        1
1937        1
Name: YearMade, Length: 73, dtype: int64
plt.figure(figsize=(14,10))
sns.countplot(total_df['YearMade'])
plt.xticks(rotation=90)
plt.show()

请添加图片描述

第一台拖拉机1904年才发明出来,1904年前的属于异常数据。同时该数据集截至年份是2012年,大于2012的属于异常。异常值后续增加一个新的衍生变量YearMade_is_error区分。

生产时间大于销售时间的,暂时认为是提前订货,不处理。

1.4.4 MachineHoursCurrentMeter

# 查看枚举值
total_df['MachineHoursCurrentMeter'].value_counts()
0.0        73834
2000.0       124
1000.0       117
24.0         115
1500.0       101
           ...  
10834.0        1
3499.0         1
26270.0        1
26901.0        1
17920.0        1
Name: MachineHoursCurrentMeter, Length: 15633, dtype: int64

1.4.5 UsageBand

# 查看枚举值
total_df['UsageBand'].value_counts()
Medium    35832
Low       25311
High      12527
Name: UsageBand, dtype: int64

1.4.6 fiBaseModel

total_df['fiBaseModel'].value_counts()
580      20179
310      17886
D6       13527
416      12900
D5        9636
         ...  
56           1
B4230        1
IS30         1
MM555        1
WLK15        1
Name: fiBaseModel, Length: 1961, dtype: int64

1.4.7 fiSecondaryDesc

total_df['fiSecondaryDesc'].value_counts().head(10)
     136420
C     44658
B     40446
G     38139
H     24759
E     21944
D     20132
F      9454
K      8089
A      5968
Name: fiSecondaryDesc, dtype: int64
temp = pd.pivot_table(total_df, index='fiSecondaryDesc', values='SalePrice', aggfunc
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Sprite.Nym

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值