pandas组队学习-task07_pandas mean 忽略缺失值吗-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_41834327/article/details/112160169

pandas组队学习-task07

import numpy as np
import pandas as pd

path = r'C:\Users\yongx\Desktop\data'

缺失值的统计和删除

缺失数据的统计

通过使用isna和isnull来查看每个单元格是否有缺失,同时结合mean函数求得每列缺失值的比例.
针对某列或某行统计缺失值时使用Series上的isna或notna而进行布尔索引
同时对若干个列检索全部缺失或存在缺失的行时通过使用isna,notna和all,any函数的组合实现对应功能

df = pd.read_csv(path + '\\learn_pandas.csv',
                usecols = ['Grade','Name','Gender','Height','Weight','Transfer'])
df.isna().head(3)

	Grade	Name	Gender	Height	Weight	Transfer
0	False	False	False	False	False	False
1	False	False	False	False	False	False
2	False	False	False	False	False	False

df.isna().mean()

Grade       0.000
Name        0.000
Gender      0.000
Height      0.085
Weight      0.055
Transfer    0.060
dtype: float64

df[df.Height.isna()].head(3)

	Grade	Name	Gender	Height	Weight	Transfer
3	Sophomore	Xiaojuan Sun	Female	NaN	41.0	N
12	Senior	Peng You	Female	NaN	48.0	NaN
26	Junior	Yanli You	Female	NaN	48.0	N

sub_set = df[['Height', 'Weight', 'Transfer']]
df[sub_set.isna().all(1)].head(3)

	Grade	Name	Gender	Height	Weight	Transfer
102	Junior	Chengli Zhao	Male	NaN	NaN	NaN

df[sub_set.isna().any(1)].head(3)

	Grade	Name	Gender	Height	Weight	Transfer
3	Sophomore	Xiaojuan Sun	Female	NaN	41.0	N
9	Junior	Juan Xu	Female	164.8	NaN	N
12	Senior	Peng You	Female	NaN	48.0	NaN

df[sub_set.notna().all(1)].head()

	Grade	Name	Gender	Height	Weight	Transfer
0	Freshman	Gaopeng Yang	Female	158.9	46.0	N
1	Freshman	Changqiang You	Male	166.5	70.0	N
2	Senior	Mei Sun	Male	188.9	89.0	N
4	Sophomore	Gaojuan You	Male	174.0	74.0	N
5	Freshman	Xiaoli Qian	Female	158.0	51.0	N

缺失信息的删除

使用dropna函数来实现对缺失值的删除操作,主要参数有:

axis:轴方向参数,默认为0,删除行.
how: 删除方式参数, 有all和any两种参数可被选择.
thresh:删除的非缺失值个数阈值,非缺失值没有达到这个数量的的相应维度则会被删除.
subset:备选的删除子集

res = df.dropna(how = 'any', subset = ['Height', 'Weight'])
res.shape

(174, 6)

res = df.dropna(1, thresh = df.shape[0]-15)
res.head(3)

	Grade	Name	Gender	Weight	Transfer
0	Freshman	Gaopeng Yang	Female	46.0	N
1	Freshman	Changqiang You	Male	70.0	N
2	Senior	Mei Sun	Male	89.0	N

#使用布尔索引来实现相同的功能
res = df.loc[df[['Height', 'Weight']].notna().all(1)]
res.shape
res = df.loc[:, ~(df.isna().sum() > 15)]
res.head(3)

	Grade	Name	Gender	Weight	Transfer
0	Freshman	Gaopeng Yang	Female	46.0	N
1	Freshman	Changqiang You	Male	70.0	N
2	Senior	Mei Sun	Male	89.0	N

缺失值的填充和插值

使用`filna`方法进行填充

filna常用参数:

1 value:填充量,可以为标量,也可以为索引到元素的字典映射
2 method:填充方法,分为使用前面的元素填充ffill和使用后面的元素填充bfill两种类型
3 limit参数表示连续缺失值的最大填充数

s = pd.Series([np.nan, 1, np.nan, np.nan, 2, np.nan], list('aaabcd'))
s

a    NaN
a    1.0
a    NaN
b    NaN
c    2.0
d    NaN
dtype: float64

s.fillna(method = 'ffill')

a    NaN
a    1.0
a    1.0
b    1.0
c    2.0
d    2.0
dtype: float64

s.fillna(method = 'ffill', limit = 1)

a    NaN
a    1.0
a    1.0
b    NaN
c    2.0
d    2.0
dtype: float64

s.fillna(s.mean())

a    1.5
a    1.0
a    1.5
b    1.5
c    2.0
d    1.5
dtype: float64

s.fillna({'a':100, 'd':200})

a    100.0
a      1.0
a    100.0
b      NaN
c      2.0
d    200.0
dtype: float64

df.groupby('Grade')['Height'].transform(lambda x : x.fillna(x.mean())).head()

0    158.900000
1    166.500000
2    188.900000
3    163.075862
4    174.000000
Name: Height, dtype: float64

#练一练
#对一个序列以如下规则填充缺失值：如果单独出现的缺失值，
#就用前后均值填充，如果连续出现的缺失值就不填充，
#即序列[1, NaN, 3, NaN, NaN]填充后为[1, 2, 3, NaN, NaN]，
#请利用 fillna 函数实现。（提示：利用 limit 参数）

插值函数

可以简单的分为三类情况:线性插值,最近邻插值和索引插值
interpolate常用参数有以下:

1 method:插值方法,默认为线性插值
2 limit_direction:限制的插值方向, 默认为forward等价于fillna中的method=ffill
3 limit:最大连续缺失值插值个数

注意:

使用当interpolate函数使用polynomial插值方法时,其内部调用scipy.interpolate.interpld(*,*,kind=order)而这个函数内部调用的是make_inter_spline方法,因此其实为样条插值而不是多项式拟合插值(numpy中的polyfit多项式拟合插值);
使用spline方法时内部调用函数为scipy.interpolate.UnivariateSpline并非普通的样条插值.

s = pd.Series([np.nan, np.nan, 1, np.nan, np.nan, np.nan, 2, np.nan,np.nan])
s.values

array([nan, nan,  1., nan, nan, nan,  2., nan, nan])

#后向线性插值
res = s.interpolate(limit_direction = 'backward', limit = 1)
res.values

array([ nan, 1.  , 1.  ,  nan,  nan, 1.75, 2.  ,  nan,  nan])

#双向线性插值
res = s.interpolate(limit_direction = 'both', limit = 1)
res.values

array([ nan, 1.  , 1.  , 1.25,  nan, 1.75, 2.  , 2.  ,  nan])

#最近邻插值
s.interpolate('nearest').values

array([nan, nan,  1.,  1.,  1.,  2.,  2., nan, nan])

#索引插值,根据索引大小进行线性插值
s = pd.Series([0, np.nan, 10], index = [0,1,10])
s

0      0.0
1      NaN
10    10.0
dtype: float64

s.interpolate()

0      0.0
1      5.0
10    10.0
dtype: float64

s.interpolate(method = 'index')

0      0.0
1      1.0
10    10.0
dtype: float64

#索引插值常用于时间戳索引数据填充插值
s = pd.Series([0, np.nan, 10], index = pd.to_datetime(['20200101',
                                                      '20200102',
                                                      '20200111']))
s

2020-01-01     0.0
2020-01-02     NaN
2020-01-11    10.0
dtype: float64

s.interpolate(method = 'index')

2020-01-01     0.0
2020-01-02     1.0
2020-01-11    10.0
dtype: float64

Nullable类型

缺失记号及缺陷

python中使用None表示缺失值,该元素除了等于自己本身之外与其他任何元素不相等
numpy中缺失值以np.nan代替,该元素和所有元素都不相等(包括自身)
对缺失序列或表格的元素进行比较操作的时候,np.nan的对应位置会返回False,但在使用equals函数进行两张表或两个序列的相同性检验时,将自动跳过两侧表都是缺失值的位置并直接返回True
时间序列对象中pandas以pd.NaT来指代缺失值,作用等价于np.nan,原因是pandas中Object这种混杂对象类型,当出现多个类型的元素同时存储在Series中时,其类型将转变为Object;而NaT则是因为np.nan为浮点类型,若浮点和时间类型混合存储,并且未设计新的内置缺失类型将其处理,将变为Object类型,从而导致各种问题,并且由于np.nan为浮点类型,当一个整数的Series中出现缺失时,其类型将转变为float64,当一个布尔类型的序列中出现缺失值时,其类型将变为Object(float于bool混合而成).
新的缺失类型有pd.NA与三种Nullable序列类型来应对这些缺陷分别为:Int,boolean,string三种

print(None==None)
print(None==False)
print(None==[])
print(None=='')
print('-'*20)
print(np.nan==np.nan)
print(np.nan==None)
print(np.nan==False)
print(np.nan==[])
print(np.nan=='')

True
False
False
False
--------------------
False
False
False
False
False

s1 = pd.Series([1, np.nan])
s2 = pd.Series([1,2])
s3 = pd.Series([1,np.nan])
print(s1 == 1)
print('_'*20)
print(s1.equals(s2))
print(s1.equals(s3))
print('-'*20)
print(pd.to_timedelta(['30s',np.nan]))
print(pd.to_datetime(['20200101',np.nan]))
print('-'*20)
print(pd.Series([1,'two']))
print('-'*20)
print(type(np.nan))

0     True
1    False
dtype: bool
____________________
False
True
--------------------
TimedeltaIndex(['0 days 00:00:30', NaT], dtype='timedelta64[ns]', freq=None)
DatetimeIndex(['2020-01-01', 'NaT'], dtype='datetime64[ns]', freq=None)
--------------------
0      1
1    two
dtype: object
--------------------
<class 'float'>

print(pd.Series([1,np.nan]).dtype)
print(pd.Series([True, False, np.nan]).dtype)

float64
object

Nullable类型的性质

Nullable类型即为序列类型不受缺失值的影响.Nullable类型中的Int,boolean,string三种类型中存储缺失值均将被转为pandas内置的pd.NA
Int类型的序列中,返回的结果将变为Nullable类型
boolean类型的序列中,相对于bool序列区别主要在于以下两种

包含缺失值的布尔列表无法通过索引器进行选择,但boolean则会将缺失值看作False,但在boolean类型中缺失值将被看作False
进行逻辑运算时,bool类型将会在缺失处返回False,但bolean则会通过判断逻辑运算可以确定唯一的结果来返回相应的值,例如 True | pd.NA无论缺失值为何值都将返回True,False | pd.NA则会根据缺失值取值的不同而变化,此时将只能返回pd.NA,又比如False & pd.NA,无论缺失值为何值,必然只能返回False

在数据处理时,一般在读入数据及之后,通过convert_dtypes转为Nullable类型

print(pd.Series([np.nan,1],dtype='Int64'))
print('-'*20)
print(pd.Series([np.nan, True],dtype = 'boolean'))
print('-'*20)
print(pd.Series([np.nan, 'str'],dtype='string'))

0    <NA>
1       1
dtype: Int64
--------------------
0    <NA>
1    True
dtype: boolean
--------------------
0    <NA>
1     str
dtype: string

print(pd.Series([np.nan,0],dtype='Int64')+1)
print('-'*20)
print(pd.Series([np.nan,0],dtype='Int64')==0)
print('-'*20)
print(pd.Series([np.nan,0],dtype='Int64')*0.5)

0    <NA>
1       1
dtype: Int64
--------------------
0    <NA>
1    True
dtype: boolean
--------------------
0    NaN
1    0.0
dtype: float64

s = pd.Series(['a','b'])
s_bool = pd.Series([True, np.nan])
s_boolean = pd.Series([True, np.nan],dtype = 'boolean')

s[s_boolean]

0    a
dtype: object

s[s_bool]#由于缺失值无法被索引器选择,故产生报错

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-87-bcb43881dd42> in <module>()
----> 1 s[s_bool]#由于缺失值无法被索引器选择,故产生报错


~\Miniconda3\envs\TF_2C\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
    899             key = list(key)
    900 
--> 901         if com.is_bool_indexer(key):
    902             key = check_bool_indexer(self.index, key)
    903             key = np.asarray(key, dtype=bool)


~\Miniconda3\envs\TF_2C\lib\site-packages\pandas\core\common.py in is_bool_indexer(key)
    132                 na_msg = "Cannot mask with non-boolean array containing NA / NaN values"
    133                 if isna(key).any():
--> 134                     raise ValueError(na_msg)
    135                 return False
    136             return True


ValueError: Cannot mask with non-boolean array containing NA / NaN values

print(s_boolean&True)
print('-'*20)
print(s_boolean|True)
print('-'*20)
print(~s_boolean)

0    True
1    <NA>
dtype: boolean
--------------------
0    True
1    True
dtype: boolean
--------------------
0    False
1     <NA>
dtype: boolean

缺失数据的计算和分组

调用sum,prob函数方法计算加法和乘法的时候,缺失数据分别等价于0和1(保证计算结果不改变)
使用累计函数方法时将自动跳过缺失值所在位置
单个标量运算时,除np.nan ** 0,1 ** np.nan为确定值外,所有运算结果全为缺失,相同与pd.NA,并且np.nan在进行比较操作时一定范围False,但pd.NA则返回pd.NA
diff,pct_change函数方法功能类似,但在缺失值处理的对策上不同,diff将参与缺失计算的部分全部设为缺失值,但pct_change则是将缺失值设置为0%的变化率
groubpy,get_dummies中可通过设置相应参数增加缺失类别

#使用sum,prod时缺失值将默认设置为0或1以保证运算结果的不变
s = pd.Series([2,3,np.nan,4,5])
print(s.sum()) 
print(s.prod())

14.0
120.0

print(s.cumsum())#累计函数时跳过缺失值

0     2.0
1     5.0
2     NaN
3     9.0
4    14.0
dtype: float64

print(np.nan==0)
print(pd.NA==0)
print('-'*20)
print(np.nan>0)
print(pd.NA>0)

False
<NA>
--------------------
False
<NA>

print(np.nan+1)
print(np.log(np.nan))
print(np.add(np.nan,1))

nan
nan
nan

print(np.nan**0)
print(pd.NA**0)
print('-'*20)
print(1**np.nan)
print(1**pd.NA)

1.0
1
--------------------
1.0
1

s.diff()#将参与计算的部分设为缺失值

0    NaN
1    1.0
2    NaN
3    NaN
4    1.0
dtype: float64

s.pct_change()#将缺失值设置为0%的变化率

0         NaN
1    0.500000
2    0.000000
3    0.333333
4    0.250000
dtype: float64

df_nan = pd.DataFrame({'category':['a','a','b',np.nan,np.nan],
                      'value':[1,3,5,7,9]})
df_nan

	category	value
0	a	1
1	a	3
2	b	5
3	NaN	7
4	NaN	9

df_nan.groupby('category', dropna=False)['value'].mean()

category
a      2
b      5
NaN    8
Name: value, dtype: int64

pd.get_dummies(df_nan.category, dummy_na = True)

	a	b	NaN
0	1	0	0
1	1	0	0
2	0	1	0
3	0	0	1
4	0	0	1

pandas组队学习-task07

pandas组队学习-task07

缺失值的统计和删除

缺失数据的统计

缺失信息的删除

缺失值的填充和插值

使用filna方法进行填充

插值函数

Nullable类型

缺失记号及缺陷

Nullable类型的性质

缺失数据的计算和分组

使用`filna`方法进行填充