Pandas 数据清洗

最新推荐文章于 2025-06-05 10:06:24 发布

宁缺100

最新推荐文章于 2025-06-05 10:06:24 发布

阅读量288

点赞数 1

CC 4.0 BY-SA版权

分类专栏：大数据文章标签： Pandas 数据清洗

本文链接：https://blog.youkuaiyun.com/qq_24434491/article/details/95390382

大数据专栏收录该内容

36 篇文章

订阅专栏

博客介绍了使用Pandas进行数据清洗的相关内容，包括处理空值（如使用df.fillna，有删除、替换、填充等方式）、处理重复值和异常值，还涉及删除特定列、改变index、用.str()方法清洗columns以及重命名columns等操作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

处理空值
重复值
异常值
删除特定列
改变index
.str() 方法来清洗 columns
重命名 columns

处理空值 df.fillna

删除
替换
填充

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

Parameters:	
value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.

method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap

axis : {0 or ‘index’, 1 or ‘columns’}
inplace : boolean, default False
If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame).

limit : int, default None
If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

downcast : dict, default is None
a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)

Returns:	
filled : DataFrame

method= pad | ffill ： 将前一个有效值赋给空值段

给每个列指定空值替换值
 values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
 df.fillna(value=values)
 
只替换第一个空值
 limit=1

重复值

删除重复行
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)
subset : column label or sequence of labels, optional 
用来指定特定的列，默认所有列
keep : {‘first’, ‘last’, False}, default ‘first’ 
删除重复项并保留第一次出现的项
inplace : boolean, default False 
是直接在原来数据上修改还是保留一个副本

删除 上级机构 这列重复的值
group.drop_duplicates(['上级机构'],keep='first',inplace=True)

异常值

applaymap(testFunction) 所有值都经过自定义函数
applay() 可以指定列的数据应用函数


def turning(x):
    if type(x)==str:
        if(x == '重庆市供电段'):
            x = '重庆供电段'
    return x

group['机构名称']=group['机构名称'].apply(turning)

group=group.applymap(turning)

删除特定列

some_drop=['单位类型']
# inplace : true 在原来基础上删除  false 新的基础上删除返回新对象
group.drop(some_drop,inplace=False,axis=1)

改变index

group.set_index('机构编号',inplace=True)
group.loc['0301']

[   group.loc[m] for m in group.index]

.str() 方法来清洗 columns

jiechuwang_type = group['单位类型'].str.contains('接触网')
# depttype.str.contains('接触网')
print(jiechuwang_type)

重命名 columns

newname={'机构编号':'num','机构名称':'name','上级机构':'prename','单位类型':'type'}
group.rename(columns=newname,inplace=False)