数据预处理-缺失值处理

ITLiu_JH

已于 2022-03-08 09:18:00 修改

阅读量2.1k

点赞数 2

CC 4.0 BY-SA版权

分类专栏：数据分析入门文章标签： sklearn python 机器学习

于 2022-02-28 08:36:08 首次发布

本文链接：https://blog.youkuaiyun.com/it_liujh/article/details/123085692

数据分析入门专栏收录该内容

39 篇文章

订阅专栏

@数据预处理

数据预处理简介

实际收集的数据往往因为各种原因导致原始数据的不一致（如不同的数据来源，不一样的计量单位），噪声数据（如采集设备抗干扰能力差、人工输入的错误），数据缺失、不完整（如问卷填写不完整、采集设备故障）等数据质量问题。
数据质量直接影响建模效果，在正式构建模型之前需要对数据进行恰当的预处理。

缺失值处理：真实的数据往往因为各种原因存在缺失值，需要用删除或填补来得到一个完整的数据子集。
离群值检测和处理：检测数据集中那些明显偏离数据集中的其他样本，为数据分析提供高质量的数据。
标准化：大多数机器学习算法需要其输入特征为标准化形式；若样本的特征之间的量纲差异太大，结果将存在偏差。
特征编码：模型输入的特征通常需要是数值型的，所以需要将非数值型特征转换为数值特征。
离散化：在数据信息损失尽量少的前提下，尽可能减少元数。
降维：降低数据维度，提高数据处理的效率、方便可视化。

1、缺失值处理

1）填补法

sklearn

sklearn.impute.SimpleImputer
Scikit-learn中缺失值填补函数用法：
SimpleImputer(
missing_values=np.nan, #缺失值的占位符
strategy=‘mean’, #填补策略
fill_value=None, #策略"constant"的常数
verbose=0, #控制Imputer的详细程度
copy=True #True，将创建X的副本；False，填补将在X上进行，有例外！
)
strategy:
“mean”，则使用平均值替换缺失值，一般为连续数据。均值填补使得特征的方差变小。
“median”，则使用中位数替换缺失值。
“most_frequent”，则使用众数替换缺失，一般为离散数据。
“constant”,常数填充。
eg:
#导包
import pandas as pd
from sklearn.impute import SimpleImputer
#读入数据
data=pd.read_csv(“file_path”)
#查看数据缺失情况
data.info()
#实例化SimpleImputer对象imp_mean，imp_most
imp_mean=SimpleImputer(missing_values=np.nan,strategy=“mean”) #策略均值填充
#创建新列接收返回值
data[“new_column1”]=imp_mean(data[[“需要填充的特征名（数值型）”]]) # 返回均值填充后的array
imp_most=SimpleImputer(missing_values=np.nan,strategy=“mean”) #策略众数填充
#创建新列接收返回值
data[“new_column2”]=imp_most(data[[“需要填充的特征名（非数值型）”]]) #返回众数填充后的array

pandas对象的方法

fillna(
value=None,
method=None,
axis=None,
inplace=False,
limit=None,
downcast=None,
)
eg:
data.fillna(method=“ffill”)

Parameters

value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a
dict/Series/DataFrame of values specifying which value to use for
each index (for a Series) or column (for a DataFrame). Values not
in the dict/Series/DataFrame will not be filled. This value cannot
be a list.
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series
pad / ffill: propagate last valid observation forward to next valid
backfill / bfill: use next valid observation to fill gap.
axis : {0 or ‘index’, 1 or ‘columns’}
Axis along which to fill missing values.
inplace : bool, default False
If True, fill in-place. Note: this will modify any
other views on this object (e.g., a no-copy slice for a column in a
DataFrame).
limit : int, default None
If method is specified, this is the maximum number of consecutive
NaN values to forward/backward fill. In other words, if there is
a gap with more than this number of consecutive NaNs, it will only
be partially filled. If method is not specified, this is the
maximum number of entries along the entire axis where NaNs will be
filled. Must be greater than 0 if not None.
downcast : dict, default is None
A dict of item->dtype of what to downcast if possible,
or the string ‘infer’ which will try to downcast to an appropriate
equal type (e.g. float64 to int64 if possible).

2）删除法

删除法通过删除包含缺失值的数据，来得到一个完整的数据子集。
删除特征：当某个特征缺失值较多，且该特征对数据分析的目标影响不大时，可以将该特征删除。
删除样本：删除存在数据缺失的样本。该方法适合某些样本有多个特征存在缺失值，且存在缺失值的样本占整个数据集样本数量的比例不高的情形。

pandas对象的方法
dropna(
axis=0,
how=‘any’,
thresh=None,
subset=None,
inplace=False
)

eg:
data.dropna(how=“any”,thresh=“不被删除的最少有效数据”)

how : {‘any’, ‘all’}, default ‘any’
Determine if row or column is removed from DataFrame, when we have
at least one NA or all NA.

* 'any' : If any NA values are present, drop that row or column.
* 'all' : If all values are NA, drop that row or column.

thresh : int, optional
Require that many non-NA values.
subset : array-like, optional
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include.
inplace : bool, default False
If True, do operation inplace and return None.