数据挖掘入门项目二手交易车价格预测之建模调参_零基础入门数据挖掘

本文链接：https://blog.youkuaiyun.com/David_house/article/details/137403721

文章目录

目标
步骤
总结

本文数据集来自阿里天池：https://tianchi.aliyun.com/competition/entrance/231784/information
主要参考了Datawhale的整个操作流程：https://tianchi.aliyun.com/notebook/95460
小编也是第一次接触数据挖掘，所以先跟着Datawhale写的教程操作了一遍，不懂的地方加了一点点自己的理解，感谢Datawhale！

目标

了解常用的机器学习模型，并掌握机器学习模型的建模与调参流程

步骤

1. 调整数据类型，减少数据在内存中占用的空间

具体方法定义如下：
对每一列循环，将每一列的转化为对应的数据类型，在不损失数据的情况下，尽可能地减少DataFrame中每列的内存占用

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum()  # memory_usage() 方法返回每一列的内存使用情况，sum() 将它们相加。
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    # 对每一列循环
    for col in df.columns:
        col_type = df[col].dtype # 获取列类型
        if col_type != object:
            # 获取当前列的最小值和最大值
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                # np.int8 是 NumPy 中表示 8 位整数的数据类型。
                # np.iinfo(np.int8) 返回一个描述 np.int8 数据类型的信息对象。
                # .min 是该信息对象的一个属性，用于获取该数据类型的最小值。
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category') # 将当前列的数据类型转换为分类类型，以节省内存
    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

调用上述函数查看效果：
其中，data_for_tree.csv保存的是我们在特征工程步骤中简单处理过的特征

sample_feature = reduce_mem_usage(pd.read_csv('data_for_tree.csv'))

在这里插入图片描述

2. 使用线性回归来简单建模

因为上述特征当时是为树模型分析保存的，所以没有对空值进行处理，这里简单处理一下

sample_feature.head()

在这里插入图片描述
可以看到notRepairedDamage这一列有异常值‘-’：

sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)
sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)

建立训练数据和标签：

train_X = sample_feature.drop('price',axis=1)
train_y = sample_feature['price']

简单建模：

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model = model.fit(train_X, train_y)
'intercept:'+ str(model.intercept_) # 这一行代码用于输出模型的截距（即常数项）
sorted(dict(zip(sample_feature.columns, model.coef_)).items(), key=lambda x:x[1], reverse=True) # 这行代码是用于输出模型的系数，并按照系数的大小进行排序
# sample_feature.columns 是特征的列名。
# model.coef_ 是线性回归模型的系数。
# zip(sample_feature.columns, model.coef_) 将特征列名与对应的系数打包成元组。