kaggle学习二——丢失数据的处理

最新推荐文章于 2024-11-11 09:17:46 发布

原创最新推荐文章于 2024-11-11 09:17:46 发布 · 688 阅读

0 ·

CC 4.0 BY-SA版权

kaggle 专栏收录该内容

8 篇文章

订阅专栏

本文介绍了几种处理数据集中缺失值的有效方法，包括直接删除、使用均值或中位数插补，并介绍了如何利用Scikit-Learn进行插值操作。此外还提出了一种进阶策略，即记录哪些值被插补过，这种方法有时可以显著提高预测模型的表现。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1.直接丢弃

当然一般效果不好

2.采用插值的方式处理

均值，中值等

sklearn.impute

3.插值的加强版

（说实话，没有太看懂）

Imputation is the standard approach, and it usually works well. However, imputed values may by systematically above or below their actual values (which weren't collected in the dataset). Or rows with missing values may be unique in some other way. In that case, your model would make better predictions by considering which values were originally missing. Here's how it might look:

# make copy to avoid changing original data (when Imputing)
new_data = original_data.copy()

# make new columns indicating what will be imputed
cols_with_missing = (col for col in new_data.columns 
                                 if new_data[col].isnull().any())
for col in cols_with_missing:
    new_data[col + '_was_missing'] = new_data[col].isnull()

# Imputation
my_imputer = SimpleImputer()
new_data = pd.DataFrame(my_imputer.fit_transform(new_data))
new_data.columns = original_data.columns

In some cases this approach will meaningfully improve results. In other cases, it doesn't help at all.