二手车价预测与模型调优-优快云博客

本文链接：https://blog.youkuaiyun.com/zhangycode/article/details/105253470

数据的数学模型描述了数据不同部分之间的关系，常用的机器学习的模型可以分为线性模型，比如线性回归，还有树模型，如决策树、随机森林、GBDT、XGBoost、LightGBM，还有在CV中广泛运用的卷积神经网络，等等，这里以二手车交易交易价格预测的赛题为例，介绍模型的选择和调参方法，适合于传统的机器学习方法。

赛题链接：
https://tianchi.aliyun.com/competition/entrance/231784/information

数据读入，这里的数据是经过上一篇的特征工程处理后的数据，使用pandas库读入，划分出训练数据和标签

import pandas as pd
sample_feature = pd.read_csv('data_for_tree.csv')
continuous_feature_names = [x for x in sample_feature.columns if x not in ['price','brand','model','brand']]
sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)
sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)
train = sample_feature[continuous_feature_names + ['price']]

train_X = train[continuous_feature_names]
train_y = train['price']

模型构建，这里采用的是随机森林的模型，直接调用sklearn中的包即可，也可以采用其他的模型，如xgboost、lightbgm等，实验多个不同的模型，比较结果，然后选择结果最好的那个模型。

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()

采用五折交叉验证，五折交叉验证，是把训练集的数据，划分成5份，每次把其中的4份作为测试集，另一份作为验证集，如此循环，用于找到使得模型泛化性能最优的超参值，这里也是直接使用sklearn库中的包

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error,  make_scorer

def log_transfer(func):
    def wrapper(y, yhat):
        result = func(np.log(y), np.nan_to_num(np.log(yhat)))
        return result
    return wrapper
    
scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))

输出结果

result = dict()
model_name = str(model)
result[model_name] = scores
result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
print(result)