零基础入门数据挖掘-Task4 建模调参

目录

1. 学习内容

2. 准备工作

3. 用线性回归简单建模

3.1 用简单线性回归建模

3.2 查看效果并做相应的调整

3.3 K-折交叉验证

3.4 模拟真实的业务情况

3.5 绘制学习率曲线和验证曲线

4. 多模型对比

4.1 预处理

4.2 线性模型与嵌入式特征选择

4.3 非线性模型

5. 模型调参(以LGB模型为例)

5.1 贪心调参

5.2 网格调参

5.3 贝叶斯调参

6. 参考文献

1. 学习内容

1. 了解常用的机器学习模型,并掌握机器学习模型的建模与调参流程

2. 线性回归模型

3. 模型性能验证

4. 嵌入式特征选择

5. 不同模型效果对比

6. 模型调参方法:贪心、网格和贝叶斯

本项目参见https://github.com/datawhalechina/team-learning

2. 准备工作

在建模之前,除了导入相关模块和数据之外,还需要调整各个特征的类型以达到压缩内存空间的目的。

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

def reduce_mem_usage(df):
    '''
    修改数据类型,压缩存储空间
    主要思想就是数值特征转换成占用内存尽可能小的类型
    字符串特征转换成类别特征
    '''
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

sample_feature = reduce_mem_usage(pd.read_csv(r'./data/data_for_tree.csv'))

输出结果如下:

Memory usage of dataframe is 62099672.00 MB
Memory usage after optimization is: 16520303.00 MB
Decreased by 73.4%

可以看到最终压缩了73.4%的空间!

3. 用线性回归简单建模

3.1 用简单线性回归建模

代码如下:

from sklearn.linear_model import LinearRegression

sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop = True)
sample_feature['notRepairedDamage'] = \
    sample_feature['notRepairedDamage'].astype(np.float32)
continuous_feature_names = [x for x in sample_feature.columns \
                            if x not in ['price','brand','model','brand']]
train = sample_feature[continuous_feature_names + ['price']]

X_train = train[continuous_feature_names]
y_train = train['price']

lr = LinearRegression(normalize = True)
model = lr.fit(X_train, y_train)

# 查看训练的线性回归模型的截距(intercept)与权重(coef)
print('intercept:'+ str(model.intercept_))

sorted(dict(zip(continuous_feature_names, model.coef_)).items(), \
       key = lambda x: x[1], reverse = True)

输出结果如下:

intercept:-110670.6827721497
[('v_6', 3367064.3416418717),
 ('v_8', 700675.5609399063),
 ('v_9', 170630.2772322219),
 ('v_7', 32322.66193203625),
 ('v_12', 20473.670796956616),
 ('v_3', 17868.079541493582),
 ('v_11', 11474.938996702811),
 ('v_13', 11261.764560014222),
 ('v_10', 2683.9200905932366),
 ('gearbox', 881.8225039247454),
 ('fuelType', 363.90425072159144),
 ('bodyType', 189.60271012069165),
 ('city', 44.949751205222555),
 ('power', 28.553901616746646),
 ('brand_price_median', 0.5103728134080039),
 ('brand_price_std', 0.450363470926374),
 ('brand_amount', 0.1488112039506524),
 ('brand_price_max', 0.003191018670311645),
 ('SaleID', 5.355989919856515e-05),
 ('train', -1.0244548320770264e-07),
 ('offerType', -2.930755726993084e-07),
 ('seller', -2.7147470973432064e-06),
 ('brand_price_sum', -2.175006868187502e-05),
 ('name', -0.00029800127130996705),
 ('used_time', -0.0025158943328600102),
 ('brand_price_average', -0.40490484510127067),
 ('brand_price_min', -2.246775348689046),
 ('power_bin', -34.42064411722464),
 ('v_14', -274.7841180775971),
 ('kilometer', -372.8975266606936),
 ('notRepairedDamage', -495.19038446280786),
 ('v_0', -2045.0549573554758),
 ('v_5', -11022.98624049396),
 ('v_4', -15121.731109856253),
 ('v_2', -26098.299920522953),
 ('v_1', -45556.18929727541)]

3.2 查看效果并做相应的调整

from matplotlib import pyplot as plt

subsample_index = np.random.randint(low = 0, high = len(y_train), size = 50)

# 绘制特征v_9的值与标签的散点图
plt.scatter(X_train['v_9'][subsample_index], \
            y_train[subsample_index], color = 'black')
plt.scatter(X_train['v_9'][subsample_index], \
            model.predict(X_train.loc[subsample_index]), color = 'blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc = 'upper right')
plt.show()

从图片中可以发现模型的预测结果(蓝色点)与真实标签(黑色点)的分布差异较大,且部分预测值出现了小于0的情况。这说明我们的模型存在一些问题。通过下面的作图我们发现数据的标签(price)呈现长尾分布,不利于我们的建模预测。实际上,很多模型都假设数据误差项服从正态分布,而长尾分布的数据违背了这一假设,因此效果不是很好[1]

import seaborn as sns

plt.figure(figsize = (15, 5))
plt.subplot(1, 2, 1)
sns.distplot(y_train)
plt.subplot(1, 2, 2)
sns.distplot(y_train[y_train < np.quantile(
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值