数据缺失值补全方法 sklearn.impute.SimpleImputer 使用实例

本文介绍使用sklearn的SimpleImputer处理数据集中的缺失值,包括不同数据分布情况下的插补策略,如均值、固定值填充,并展示了具体实例。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

一、环境
Python 3.7.3(Anaconda 3)
sklearn.version’0.20.3’

二、方法
对数据中的缺失值进行插补
官方说明:https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

三、实例
1、数据 - 缺失值 - 数据

# 中间部分数据存在缺失值
>>> import numpy as np
>>> from sklearn.impute import SimpleImputer
>>> data1 = np.loadtxt("/test/data/values_nan_values.csv", delimiter=',', dtype='str')
>>> data1_values = data1[:,0:3]
>>> data1_values
array([['0.38566807663597913', '0.36519607843137253',
        '0.2923452768729642'],
       ['0.39537198308036825', '0.3705436720142602',
        '0.29218241042345283'],
       ['0.4257277929833292', '0.3794563279857397',
        '0.30846905537459285'],
       ['0.41403334162727046', '0.3600713012477718',
        '0.3185667752442997'],
       ['0.3894003483453596', '0.39327094474153296',
        '0.3210097719869707'],
       ['0.41652152276685744', '0.14884135472370766',
        '0.25374592833876225'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['0.4088081612341379', '0.3832442067736185', '0.2571661237785017'],
       ['0.40980343368997263', '0.3794563279857397',
        '0.25195439739413683'],
       ['0.4177656133366509', '0.3765597147950089',
        '0.24739413680781763'],
       ['0.4180144314506096', '0.3790106951871658',
        '0.24739413680781763'],
       ['0.4145309778551879', '0.3807932263814616', '0.2478827361563518'],
       ['0.4120427967156009', '0.3834670231729055', '0.2526058631921824']],
      dtype='<U19')
>>> imputation_transformer1 = SimpleImputer(np.nan, "mean")
>>> values_nan_values1 = imputation_transformer1.fit_transform(data1_values)
>>> values_nan_values1
array([[0.38566808, 0.36519608, 0.29234528],
       [0.39537198, 0.37054367, 0.29218241],
       [0.42572779, 0.37945633, 0.30846906],
       [0.41403334, 0.3600713 , 0.31856678],
       [0.38940035, 0.39327094, 0.32100977],
       [0.41652152, 0.14884135, 0.25374593],
       [0.40897404, 0.35832591, 0.27422638],
       [0.40897404, 0.35832591, 0.27422638],
       [0.40897404, 0.35832591, 0.27422638],
       [0.40897404, 0.35832591, 0.27422638],
       [0.40897404, 0.35832591, 0.27422638],
       [0.40897404, 0.35832591, 0.27422638],
       [0.40897404, 0.35832591, 0.27422638],
       [0.40897404, 0.35832591, 0.27422638],
       [0.40880816, 0.38324421, 0.25716612],
       [0.40980343, 0.37945633, 0.2519544 ],
       [0.41776561, 0.37655971, 0.24739414],
       [0.41801443, 0.3790107 , 0.24739414],
       [0.41453098, 0.38079323, 0.24788274],
       [0.4120428 , 0.38346702, 0.25260586]])

2、数据 - 缺失值

# 后半部分数据存在缺失值
>>> import numpy as np
>>> from sklearn.impute import SimpleImputer
>>> data2 = np.loadtxt("/test/data/values_nan.csv", delimiter=',', dtype='str')
>>> data2_values = data2[:,0:3]
>>> data2_values
array([['0.38566807663597913', '0.36519607843137253',
        '0.2923452768729642'],
       ['0.39537198308036825', '0.3705436720142602',
        '0.29218241042345283'],
       ['0.4257277929833292', '0.3794563279857397',
        '0.30846905537459285'],
       ['0.41403334162727046', '0.3600713012477718',
        '0.3185667752442997'],
       ['0.3894003483453596', '0.39327094474153296',
        '0.3210097719869707'],
       ['0.41652152276685744', '0.14884135472370766',
        '0.25374592833876225'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan']], dtype='<U19')
>>> imputation_transformer2 = SimpleImputer(np.nan, "mean")
>>> values_nan = imputation_transformer2.fit_transform(data2_values)
>>> values_nan
array([[0.38566808, 0.36519608, 0.29234528],
       [0.39537198, 0.37054367, 0.29218241],
       [0.42572779, 0.37945633, 0.30846906],
       [0.41403334, 0.3600713 , 0.31856678],
       [0.38940035, 0.39327094, 0.32100977],
       [0.41652152, 0.14884135, 0.25374593],
       [0.40445384, 0.33622995, 0.29771987],
       [0.40445384, 0.33622995, 0.29771987],
       [0.40445384, 0.33622995, 0.29771987],
       [0.40445384, 0.33622995, 0.29771987],
       [0.40445384, 0.33622995, 0.29771987],
       [0.40445384, 0.33622995, 0.29771987],
       [0.40445384, 0.33622995, 0.29771987],
       [0.40445384, 0.33622995, 0.29771987]])

3、缺失值 - 数据

# 前半部分数据存在缺失值
>>> import numpy as np
>>> from sklearn.impute import SimpleImputer
>>> data3 = np.loadtxt("/test/data/nan_values.csv", delimiter=',', dtype='str')
>>> data3_values = data3[:,0:3]
>>> data3_values
array([['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['0.4088081612341379', '0.3832442067736185', '0.2571661237785017'],
       ['0.40980343368997263', '0.3794563279857397',
        '0.25195439739413683'],
       ['0.4177656133366509', '0.3765597147950089',
        '0.24739413680781763'],
       ['0.4180144314506096', '0.3790106951871658',
        '0.24739413680781763'],
       ['0.4145309778551879', '0.3807932263814616', '0.2478827361563518'],
       ['0.4120427967156009', '0.3834670231729055', '0.2526058631921824']],
      dtype='<U19')
>>> imputation_transformer3 = SimpleImputer(np.nan, "mean")
>>> nan_values3 = imputation_transformer3.fit_transform(data3_values)
>>> nan_values3
array([[0.41349424, 0.38042187, 0.2507329 ],
       [0.41349424, 0.38042187, 0.2507329 ],
       [0.41349424, 0.38042187, 0.2507329 ],
       [0.41349424, 0.38042187, 0.2507329 ],
       [0.41349424, 0.38042187, 0.2507329 ],
       [0.41349424, 0.38042187, 0.2507329 ],
       [0.41349424, 0.38042187, 0.2507329 ],
       [0.41349424, 0.38042187, 0.2507329 ],
       [0.40880816, 0.38324421, 0.25716612],
       [0.40980343, 0.37945633, 0.2519544 ],
       [0.41776561, 0.37655971, 0.24739414],
       [0.41801443, 0.3790107 , 0.24739414],
       [0.41453098, 0.38079323, 0.24788274],
       [0.4120428 , 0.38346702, 0.25260586]])

4、缺失值

# 某一数据文件中全部是缺失值
>>> import numpy as np
>>> from sklearn.impute import SimpleImputer
>>> data4 = np.loadtxt("/test/data/nan.csv", delimiter=',', dtype='str')
>>> data4_values = data4[:,0:3]
>>> data4_values
array([['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan'],
       ['nan', 'nan', 'nan']], dtype='<U5')
>>> nan = imputation_transformer4.fit_transform(data4_values)
>>> nan
array([], shape=(8, 0), dtype=float64)

第四种情况比较特殊,在加载多个数据文件中不同类型的数据,可以会遇到某一文件中的特定数据全部为 nan 值,这种情况在传感器数据中是存在的!sklearn.imputer.SimpleImputer() 是可以处理这种情况的,即把所以的 nan 值最终处理为一个空的数组,实际sklearn.imputer.SimpleImputer() 方法中也可以通过固定的数据来填补这些空值

5、固定替换缺失值

# 将参数 strategy 设置为 constant,参数 fill_value 设置为指定数值,如 0
>>> imputation_transformer5 = SimpleImputer(missing_values=np.nan, strategy="constant", fill_value=0)
>>> values_nan_values5 = imputation_transformer5.fit_transform(data5_values)

正常应该可以间给所有的 nan 用 0 代替,但是这里报错:
“with an object dtype.”.format(X.dtype))
ValueError: SimpleImputer does not support data with dtype <U5. Please provide either a numeric array (with a floating point or integer dtype) or categorical data represented either as an array with integer dtype or an array of string values with an object dtype.
将原数据中的 nan,使用 astype() 转换为字符串格式也是提示该错误!

### 使用 `SimpleImputer` 类处理缺失值 在机器学习项目中,数据集经常存在缺失值的情况。为了确保模型训练的有效性和准确性,需要对这些缺失值进行适当处理。Scikit-Learn 提供了一个名为 `SimpleImputer` 的类来解决这一问题。 #### 导入必要的库 首先,需导入所需的 Python 库以及设置好环境: ```python import numpy as np from sklearn.impute import SimpleImputer ``` #### 创建并配置 `SimpleImputer` 实例 创建一个 `SimpleImputer` 对象时可以指定多个参数以适应不同的需求。例如,可以通过设定 `missing_values` 参数定义什么类型的值被认为是“丢失”的;通过 `strategy` 来决定采用何种策略填补缺失值(均值、中位数或最频繁出现的值)。下面是一个简单的例子说明如何初始化这个对象[^2]: ```python imputer = SimpleImputer( missing_values=np.nan, # 定义缺失值的形式,默认为np.nan strategy='mean', # 可选 'mean'(默认), 'median', 或者 'most_frequent' # 还可以选择'constant'用于固定数值填充 ) ``` #### 准备待处理的数据 假设有一个包含一些 NaN 值的数据矩阵 X: ```python X = [ [7, 2, 3], [4, None, 6], # 第二行第二列为NaN [10, 3, 9], [7, 3, None] # 第四行第三列为NaN ] print("原始数据:") for row in X: print(row) ``` #### 执行拟合与转换操作 调用 `.fit()` 方法imputer 学习每列的最佳替换方案,接着使用 `.transform()` 将学到的知识应用到实际数据上完成缺失值补全工作[^5]: ```python # 拟合并变换数据 filled_X = imputer.fit_transform(X) print("\n经过SimpleImputer处理后的数据:") for row in filled_X.tolist(): print([round(x, 2) if isinstance(x, float) else x for x in row]) ``` 上述代码会输出已经过简单插补器处理过的数组,在这里原本为空白的地方已经被相应特征列上的平均值所替代。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

csdn-WJW

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值