Datawhale零基础入门数据挖掘-Task4建模调参

最新推荐文章于 2020-11-24 04:17:47 发布

原创最新推荐文章于 2020-11-24 04:17:47 发布 · 319 阅读

1 ·

CC 4.0 BY-SA版权

AI组队学习同时被 2 个专栏收录

3 篇文章

订阅专栏

数据挖掘

2 篇文章

订阅专栏

本文介绍了数据挖掘中建模与调参的步骤，包括学习目标、内容概览、模型原理和代码示例。重点讨论了线性回归、决策树、GBDT、XGBoost和LightGBM等模型，以及模型评价、调参方法。同时提供了相关学习资源和书籍推荐。

四、建模与调参

Tip:此部分为零基础入门数据挖掘的 Task4 建模调参部分，带你来了解各种模型以及模型的评价和调参策略，欢迎大家后续多多交流。

赛题：零基础入门数据挖掘 - 二手车交易价格预测
地址：https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX

4.1 学习目标

**
了解常用的机器学习模型，并掌握机器学习模型的建模与调参流程
完成相应学习打卡任务

4.2 内容介绍

线性回归模型：
线性回归对于特征的要求；
处理长尾分布；
理解线性回归模型；
模型性能验证：
评价函数与目标函数；
交叉验证方法；
留一验证方法；
针对时间序列问题的验证；
绘制学习率曲线；
绘制验证曲线；
嵌入式特征选择：
Lasso回归；
Ridge回归；
决策树；
模型对比：
常用线性模型；
常用非线性模型；
模型调参：
贪心调参方法；
网格调参方法；
贝叶斯调参方法；

4.3 相关原理介绍与推荐

由于相关算法原理篇幅较长，本文推荐了一些博客与教材供初学者们进行学习。

4.3.1 线性回归模型

https://zhuanlan.zhihu.com/p/49480391

4.3.2 决策树模型

https://zhuanlan.zhihu.com/p/65304798

4.3.3 GBDT模型

https://zhuanlan.zhihu.com/p/45145899

4.3.4 XGBoost模型

https://zhuanlan.zhihu.com/p/86816771

4.3.5 LightGBM模型

https://zhuanlan.zhihu.com/p/89360721

4.3.6 推荐教材：

《机器学习》 https://book.douban.com/subject/26708119/
《统计学习方法》 https://book.douban.com/subject/10590856/
《Python大战机器学习》 https://book.douban.com/subject/26987890/
《面向机器学习的特征工程》 https://book.douban.com/subject/26826639/
《数据科学家访谈录》 https://book.douban.com/subject/30129410/

4.4 代码示例

- 4.4.1 读取数据

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings(‘ignore’)
reduce_mem_usage 函数通过调整数据类型，帮助我们减少数据在内存中占用的空间

def reduce_mem_usage(df):
“”" iterate through all the columns of a dataframe and modify the data type
to reduce memory usage.
“”"
start_mem = df.memory_usage().sum()
print(‘Memory usage of dataframe is {:.2f} MB’.format(start_mem))

for col in df.columns:
    col_type = df[col].dtype
    
    if col_type != object:
        c_min = df[col].min()
        c_max = df[col].max()
        if str(col_type)[:3] == 'int':
            if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                df[col] = df[col].astype(np.int8)
            elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                df[col] = df[col].astype(np.int16)
            elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                df[col] = df[col].astype(np.int32)
            elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                df[col] = df[col].astype(np.int64)  
        else:
            if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                df[col] = df[col].astype(np.float16)
            elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                df[col] = df[col].astype(np.float32)
            else:
                df[col] = df[col].astype(np.float64)
    else:
        df[col] = df[col].astype('category')

end_mem = df.memory_usage().sum() 
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df

sample_feature = reduce_mem_usage(pd.read_csv(‘data_for_tree.csv’))
continuous_feature_names = [x for x in sample_feature.columns if x not in [‘price’,‘brand’,‘model’,‘brand’]]

4.4.2 线性回归 & 五折交叉验证 & 模拟真实业务情况

sample_feature = sample_feature.dropna().replace(’-’, 0).reset_index(drop=True)
sample_feature[‘notRepairedDamage’] = sample_feature[‘notRepairedDamage’].astype(np.float32)
train = sample_feature[continuous_feature_names + [‘price’]]

train_X = train[continuous_feature_names]
train_y = train[‘price’]

4.4.2 - 1 简单建模

from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)
model = model.fit(train_X, train_y)
查看训练的线性回归模型的截距（intercept）与权重(coef)

‘intercept:’+ str(model.intercept_)

sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)
from matplotlib import pyplot as plt
subsample_index = np.random.randint(low=0, high=len(train_y), size=50)
绘制特征v_9的值与标签的散点图，图片发现模型的预测结果（蓝色点）与真实标签（黑色点）的分布差异较大，且部分预测值出现了小于0的情况，说明我们的模型存在一些问题

plt.scatter(train_X[‘v_9’][subsample_index], train_y[subsample_index], color=‘black’)
plt.scatter(train_X[‘v_9’][subsample_index], model.predict(train_X.loc[subsample_index]), color=‘blue’)
plt.xlabel(‘v_9’)
plt.ylabel(‘price’)
plt.legend([‘True Price’,‘Predicted Price’],loc=‘upper right’)
print(‘The predicted price is obvious different from true price’)
plt.show()

通过作图我们发现数据的标签（price）呈现长尾分布，不利于我们的建模预测。原因是很多模型都假设数据误差项符合正态分布，而长尾分布的数据违背了这一假设。参考博客：https://blog.youkuaiyun.com/Noob_daniel/article/details/76087829

import seaborn as sns
print(‘It is clear to see the price shows a typical exponential distribution’)
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y)
plt.subplot(1,2,2)
sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])

Datawhale 零基础入门数据挖掘-Task4 建模调参
四、建模与调参
Tip:此部分为零基础入门数据挖掘的 Task4 建模调参部分，带你来了解各种模型以及模型的评价和调参策略，欢迎大家后续多多交流。

赛题：零基础入门数据挖掘 - 二手车交易价格预测

地址：https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX