kaggle学习二——丢失数据的处理

本文介绍了几种处理数据集中缺失值的有效方法,包括直接删除、使用均值或中位数插补,并介绍了如何利用Scikit-Learn进行插值操作。此外还提出了一种进阶策略,即记录哪些值被插补过,这种方法有时可以显著提高预测模型的表现。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1.直接丢弃

当然一般效果不好

2.采用插值的方式处理

均值,中值等

sklearn.impute

3.插值的加强版

(说实话,没有太看懂)

Imputation is the standard approach, and it usually works well. However, imputed values may by systematically above or below their actual values (which weren't collected in the dataset). Or rows with missing values may be unique in some other way. In that case, your model would make better predictions by considering which values were originally missing. Here's how it might look:

# make copy to avoid changing original data (when Imputing)
new_data = original_data.copy()

# make new columns indicating what will be imputed
cols_with_missing = (col for col in new_data.columns 
                                 if new_data[col].isnull().any())
for col in cols_with_missing:
    new_data[col + '_was_missing'] = new_data[col].isnull()

# Imputation
my_imputer = SimpleImputer()
new_data = pd.DataFrame(my_imputer.fit_transform(new_data))
new_data.columns = original_data.columns

In some cases this approach will meaningfully improve results. In other cases, it doesn't help at all.

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值