Pandas快速教程_缺失值处理

最新推荐文章于 2025-03-17 21:40:13 发布

Lipgrant_python

最新推荐文章于 2025-03-17 21:40:13 发布

阅读量1.8k

点赞数

分类专栏： Pandas

本文链接：https://blog.youkuaiyun.com/weixin_41677555/article/details/82876253

版权

缺失值的产生有很多原因,在Pandas中,使用 NaN 来代表缺失值.

本文将从缺失值的检测,填充,删除,插入.替换几个方面来介绍pandas中对于缺失值的处理.

一.缺失值的检测

df2
Out[7]: 
   first  secend  third  fourth
a    6.0     8.0    3.0     6.0
b    2.0     5.0    9.0     8.0
c    4.0     4.0    1.0     0.0
d    6.0     2.0    2.0     6.0
e    2.0     0.0    8.0     9.0
f    NaN     NaN    NaN     NaN

不管是Series还是DataFrame,Pandas 中提供了 isna()和notna()两个方法来检测缺失值.

 pd.isna(df2)
Out[8]: 
   first  secend  third  fourth
a  False   False  False   False
b  False   False  False   False
c  False   False  False   False
d  False   False  False   False
e  False   False  False   False
f   True    True   True    True

pd.notna(df2)
Out[9]: 
   first  secend  third  fourth
a   True    True   True    True
b   True    True   True    True
c   True    True   True    True
d   True    True   True    True
e   True    True   True    True
f  False   False  False   False

df2['first'].isna()
Out[12]: 
a    False
b    False
c    False
d    False
e    False
f     True
Name: first, dtype: bool

需要特别提出的一点.在python中,None==None是成立的,但在pandas中,np.nan和np.nan是不相等的.

None==None
Out[13]: True

np.nan==np.nan
Out[14]: False

基于此特性,对于缺失值的比较判断信息就会产生误导.因为实际上nan和nan相比较用于返回的是False.比如:

df2.loc['f'] == np.nan
Out[15]: 
first     False
secend    False
third     False
fourth    False
Name: f, dtype: bool

在时间序列中,缺失值的表示不再是NaN,而是NaT(not a time).

df2['time']=pd.Timestamp('20180930')

df2
Out[17]: 
   first  secend  third  fourth       time
a    6.0     8.0    3.0     6.0 2018-09-30
b    2.0     5.0    9.0     8.0 2018-09-30
c    4.0     4.0    1.0     0.0 2018-09-30
d    6.0     2.0    2.0     6.0 2018-09-30
e    2.0     0.0    8.0     9.0 2018-09-30
f    NaN     NaN    NaN     NaN 2018-09-30

df2.loc[['d','e','f'],'time']&#

最低0.47元/天解锁文章