pandas函数应用
文章目录
表格级的函数应用 pipe()
DataFrames和Series可以传递给函数。 但是,如果需要在链中调用该函数,请考虑使用pipe()方法。
使用pipe()方法将更加直观:
若有函数f,g,h,它们传入DataFrame并且返回DataFrame,一般的写法如下:
f(g(h(df),arg1=1),arg2=2,arg3=3)
上面的语法等价于:
(df.pipe(h)
.pipe(g,arg1=1)
.pipe(f,arg2=2,arg3=3))
行或列级别的函数应用
使用apply()方法可以使任意的函数沿着轴线应用。,apply()方法有一个可选参数axis,它指定轴线,(0表示行,1表示列)
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(8,3),columns=['A','B','C'])
df.apply(np.mean)
A -0.266390
B -0.177195
C 0.390748
dtype: float64
df.apply(np.mean,axis=1)
0 0.712230
1 -0.523604
2 0.048431
3 -0.975128
4 0.179757
5 0.085986
6 -0.195474
7 0.526903
dtype: float64
df.apply(lambda x: x.max()-x.min())
A 4.005106
B 2.388907
C 3.137040
dtype: float64
df.apply(np.cumsum)
A | B | C | |
---|---|---|---|
0 | 1.898605 | -0.246092 | 0.484175 |
1 | 1.871383 | -1.434518 | 0.129012 |
2 | 2.374783 | -1.421744 | -0.241866 |
3 | 1.370274 | -1.891783 | -1.692704 |
4 | -0.336101 | -0.691302 | -0.647540 |
5 | -0.643423 | -1.599446 | 0.825884 |
6 | -2.749925 | -1.765568 | 2.512086 |
7 | -2.131123 | -1.417560 | 3.125986 |
df.apply(np.exp)
A | B | C | |
---|---|---|---|
0 | 6.676574 | 0.781851 | 1.622836 |
1 | 0.973146 | 0.304700 | 0.701059 |
2 | 1.654335 | 1.012855 | 0.690128 |
3 | 0.366224 | 0.624978 | 0.234374 |
4 | 0.181523 | 3.321714 | 2.843866 |
5 | 0.735414 | 0.403272 | 4.364154 |
6 | 0.121663 | 0.846942 | 5.398935 |
7 | 1.856701 | 1.416244 | 1.847622 |
apply()方法也可以传入字符串作为方法名:
df.apply('max')
A 1.898605
B 1.200481
C 1.686202
dtype: float64
df.apply('mean',axis=1)
0 0.712230
1 -0.523604
2 0.048431
3 -0.975128
4 0.179757
5 0.085986
6 -0.195474
7 0.526903
dtype: float64
传给apply()函数的函数参数的返回类型最终会影响apply()的返回类型:
- 如果传入的方法返回一个
Series
,最终apply()函数返回DataFrame
。 - 如果所应用的函数返回任何其他类型,则最终输出为
Series
。
可以使用result_type覆盖此默认行为,result_type接受三个选项:reduce,broadcast和expand。 这些将确定类似列表的返回值如何扩展(或不扩展)到DataFrame。
apply()结合一些技巧可以用来回答有关数据集的许多问题。 例如,假设我们要提取每一列的最大值出现的日期:
tsdf = pd.DataFrame(np.random.randn(1000,3),columns=['A','B','C'],
index = pd.date_range('1/1/2000',periods=1000))
tsdf.apply(lambda x:x.idxmax()) #idxmax()返回最大值的index
A 2002-07-14
B 2002-01-30
C 2002-07-29
dtype: datetime64[ns]
还可以将其他关键字或者参数传递给apply()方法。
假如有以下方法:
def subtract_and_divide(x,sub,divide=1):
return (x-sub)/divide
利用apply()方法来使用subtract_and_divide
:
df.apply(subtract_and_divide,args=(5,),divide=3)
#df.apply(subtract_and_divide,args=(5,3)) 这样写也可以
A | B | C | |
---|---|---|---|
0 | -1.033798 | -1.748697 | -1.505275 |
1 | -1.675740 | -2.062809 | -1.785054 |
2 | -1.498867 | -1.662409 | -1.790293 |
3 | -2.001503 | -1.823346 | -2.150279 |
4 | -2.235458 | -1.266506 | -1.318279 |
5 | -1.769107 | -1.969381 | -1.175525 |
6 | -2.368834 | -1.722041 | -1.104599 |
7 | -1.460400 | -1.550664 | -1.462033 |
另外一个有用的特性是传入Series方法来对每一行或者每一列执行操作。
tsdf = tsdf[:10]
tsdf.iloc[3:6,:]=np.nan
tsdf
A | B | C | |
---|---|---|---|
2000-01-01 | 0.575054 | -0.119034 | 0.497757 |
2000-01-02 | -0.882426 | 0.955536 | -0.171076 |
2000-01-03 | 0.238718 | 0.612552 | 0.494587 |
2000-01-04 | NaN | NaN | NaN |
2000-01-05 | NaN | NaN | NaN |
2000-01-06 | NaN | NaN | NaN |
2000-01-07 | 1.010914 | -1.706483 | 0.633015 |
2000-01-08 | 1.036831 | -0.860158 | 1.874723 |
2000-01-09 | -2.653134 | 0.244024 | 0.244642 |
2000-01-10 | -0.546855 | -2.318553 | 1.079924 |
tsdf.apply(pd.Series.interpolate)
A | B | C | |
---|---|---|---|
2000-01-01 | 0.575054 | -0.119034 | 0.497757 |
2000-01-02 | -0.882426 | 0.955536 | -0.171076 |
2000-01-03 | 0.238718 | 0.612552 | 0.494587 |
2000-01-04 | 0.431767 | 0.032794 | 0.529194 |
2000-01-05 | 0.624816 | -0.546965 | 0.563801 |
2000-01-06 | 0.817865 | -1.126724 | 0.598408 |
2000-01-07 | 1.010914 | -1.706483 | 0.633015 |
2000-01-08 | 1.036831 | -0.860158 | 1.874723 |
2000-01-09 | -2.653134 | 0.244024 | 0.244642 |
2000-01-10 | -0.546855 | -2.318553 | 1.079924 |
pd.Series.interpolate是一个差值方法,
DataFrame.interpolate(method=‘linear’, axis=0, limit=None, inplace=False, limit_direction=‘forward’, limit_area=None, downcast=None, **kwargs)插值方式
nearest:最邻近插值法zero:阶梯插值
slinear、linear:线性插值
quadratic、cubic:2、3阶B样条曲线插值(详情请参考官方文档)
聚合API,DataFrame.aggregate()或者DataFrame.agg()
聚合API允许以一种简洁的方式表达多个聚合操作。
tsdf = pd.DataFrame(np.random.randn(10,3),columns=['A','B','C'],
index=pd.date_range('1/1/2000',periods=10))
tsdf.iloc[3:7] = np.nan
tsdf
A | B | C | |
---|---|---|---|
2000-01-01 | 0.817099 | 0.902683 | 0.621878 |
2000-01-02 | -0.758221 | -0.739626 | 1.062220 |
2000-01-03 | -0.678893 | -0.658816 | 0.330570 |
2000-01-04 | NaN | NaN | NaN |
2000-01-05 | NaN | NaN | NaN |
2000-01-06 | NaN | NaN | NaN |
2000-01-07 | NaN | NaN | NaN |
2000-01-08 | -1.759138 | -1.090475 | -0.056104 |
2000-01-09 | 0.688150 | 1.449854 | -0.793539 |
2000-01-10 | -0.277132 | -1.399196 | 0.671269 |
DataFrame.agg()函数使用单个函数时等效于apply(). 和apply()一样可以传出函数名的字符串表示。
tsdf.agg(np.sum)
A -1.968135
B -1.535577
C 1.836294
dtype: float64
tsdf.agg('sum')
A -1.968135
B -1.535577
C 1.836294
dtype: float64
tsdf.sum()
A -1.968135
B -1.535577
C 1.836294
dtype: float64
在Series
上使用单个聚合将返回一个标量值。
tsdf.A.agg(sum)
-1.9681350319083668
聚合多个函数
可以将多个聚合参数作为列表传递。每个传递函数的结果将在结果DataFrame中排成一行。
index的名称为该行函数名。
tsdf.agg(['sum','mean'])
A | B | C | |
---|---|---|---|
sum | -1.968135 | -1.535577 | 1.836294 |
mean | -0.328023 | -0.255930 | 0.306049 |
tsdf.A.agg(['sum','mean'])
sum -1.968135
mean -0.328023
Name: A, dtype: float64
也可以传入lambda表达式:
tsdf.A.agg(['sum',lambda x:x.mean()])
sum -1.968135
<lambda> -0.328023
Name: A, dtype: float64
传递一个指定的函数,那么改行的名称将为传入的函数名:
def mymean(x):
return x.mean()
tsdf.A.agg(['sum',mymean])
sum -1.968135
mymean -0.328023
Name: A, dtype: float64
使用字典聚合
使用字典传递函数,可以设定每行的名称:·
tsdf.agg({'A':'mean','B':'sum'})
A -0.328023
B -1.535577
dtype: float64
传递类似列表的内容将生成一个DataFrame输出:
tsdf.agg({'A':['mean','min'],'B':'sum'})
A | B | |
---|---|---|
mean | -0.328023 | NaN |
min | -1.759138 | NaN |
sum | NaN | -1.535577 |
混合类型
当出现无法聚合的混合dtype时,.agg将只接受有效的聚合。这与groupby .agg的工作方式类似。
mdf = pd.DataFrame({'A':[1,2,3],
'B':[1.,2.,3.],
'C':['foo','bar','baz'],
'D':pd.date_range('20130101',periods=3)})
mdf.dtypes
A int64
B float64
C object
D datetime64[ns]
dtype: object
mdf.agg(['min','sum'])
A | B | C | D | |
---|---|---|---|---|
min | 1 | 1.0 | bar | 2013-01-01 |
sum | 6 | 6.0 | foobarbaz | NaT |
自定义描述
from functools import partial
q_25 = partial(pd.Series.quantile,q=0.25)
q_25.__name__='25%'
q_27 = partial(pd.Series.quantile,q=0.27)
q_27.__name__='27%'
tsdf.agg(['count','mean','std','min',q_25,'median',q_27,'max'])
A | B | C | |
---|---|---|---|
count | 6.000000 | 6.000000 | 6.000000 |
mean | -0.328023 | -0.255930 | 0.306049 |
std | 0.969822 | 1.153420 | 0.655100 |
min | -1.759138 | -1.399196 | -0.793539 |
25% | -0.738389 | -1.002763 | 0.040564 |
median | -0.478013 | -0.699221 | 0.476224 |
27% | -0.730456 | -0.967678 | 0.079232 |
max | 0.817099 | 1.449854 | 1.062220 |
转变API
transform()方法返回一个与原始对象索引相同(大小相同)的对象。
这个API允许您同时提供多个操作,而不是一个一个地提供。它的API与.agg API非常相似。
tsdf = pd.DataFrame(np.random.randn(10,3),columns=['A','B','C'],
index=pd.date_range('1/1/2000',periods=10))
tsdf.iloc[3:7] = np.nan
tsdf
A | B | C | |
---|---|---|---|
2000-01-01 | 0.285947 | -0.343464 | -1.182469 |
2000-01-02 | -0.258947 | -0.743428 | 0.317739 |
2000-01-03 | -1.006815 | -0.427918 | -0.081971 |
2000-01-04 | NaN | NaN | NaN |
2000-01-05 | NaN | NaN | NaN |
2000-01-06 | NaN | NaN | NaN |
2000-01-07 | NaN | NaN | NaN |
2000-01-08 | 0.992010 | 0.058131 | 0.508194 |
2000-01-09 | 0.926343 | -0.835812 | 0.893153 |
2000-01-10 | -1.927433 | -0.699219 | 1.873901 |
tansform()
方法允许传入numpy方法,字符串方法名和自定义方法。
tsdf.transform(np.abs)
A | B | C | |
---|---|---|---|
2000-01-01 | 0.285947 | 0.343464 | 1.182469 |
2000-01-02 | 0.258947 | 0.743428 | 0.317739 |
2000-01-03 | 1.006815 | 0.427918 | 0.081971 |
2000-01-04 | NaN | NaN | NaN |
2000-01-05 | NaN | NaN | NaN |
2000-01-06 | NaN | NaN | NaN |
2000-01-07 | NaN | NaN | NaN |
2000-01-08 | 0.992010 | 0.058131 | 0.508194 |
2000-01-09 | 0.926343 | 0.835812 | 0.893153 |
2000-01-10 | 1.927433 | 0.699219 | 1.873901 |
tsdf.transform(abs)
A | B | C | |
---|---|---|---|
2000-01-01 | 0.285947 | 0.343464 | 1.182469 |
2000-01-02 | 0.258947 | 0.743428 | 0.317739 |
2000-01-03 | 1.006815 | 0.427918 | 0.081971 |
2000-01-04 | NaN | NaN | NaN |
2000-01-05 | NaN | NaN | NaN |
2000-01-06 | NaN | NaN | NaN |
2000-01-07 | NaN | NaN | NaN |
2000-01-08 | 0.992010 | 0.058131 | 0.508194 |
2000-01-09 | 0.926343 | 0.835812 | 0.893153 |
2000-01-10 | 1.927433 | 0.699219 | 1.873901 |
tsdf.transform(lambda x:abs(x))
A | B | C | |
---|---|---|---|
2000-01-01 | 0.285947 | 0.343464 | 1.182469 |
2000-01-02 | 0.258947 | 0.743428 | 0.317739 |
2000-01-03 | 1.006815 | 0.427918 | 0.081971 |
2000-01-04 | NaN | NaN | NaN |
2000-01-05 | NaN | NaN | NaN |
2000-01-06 | NaN | NaN | NaN |
2000-01-07 | NaN | NaN | NaN |
2000-01-08 | 0.992010 | 0.058131 | 0.508194 |
2000-01-09 | 0.926343 | 0.835812 | 0.893153 |
2000-01-10 | 1.927433 | 0.699219 | 1.873901 |
当transform()传入一个参数时,它等效于 ufun
np.abs(tsdf)
A | B | C | |
---|---|---|---|
2000-01-01 | 0.285947 | 0.343464 | 1.182469 |
2000-01-02 | 0.258947 | 0.743428 | 0.317739 |
2000-01-03 | 1.006815 | 0.427918 | 0.081971 |
2000-01-04 | NaN | NaN | NaN |
2000-01-05 | NaN | NaN | NaN |
2000-01-06 | NaN | NaN | NaN |
2000-01-07 | NaN | NaN | NaN |
2000-01-08 | 0.992010 | 0.058131 | 0.508194 |
2000-01-09 | 0.926343 | 0.835812 | 0.893153 |
2000-01-10 | 1.927433 | 0.699219 | 1.873901 |
传入多个函数
tsdf.transform([np.abs,lambda x:x+1])
A | B | C | ||||
---|---|---|---|---|---|---|
absolute | <lambda> | absolute | <lambda> | absolute | <lambda> | |
2000-01-01 | 0.285947 | 1.285947 | 0.343464 | 0.656536 | 1.182469 | -0.182469 |
2000-01-02 | 0.258947 | 0.741053 | 0.743428 | 0.256572 | 0.317739 | 1.317739 |
2000-01-03 | 1.006815 | -0.006815 | 0.427918 | 0.572082 | 0.081971 | 0.918029 |
2000-01-04 | NaN | NaN | NaN | NaN | NaN | NaN |
2000-01-05 | NaN | NaN | NaN | NaN | NaN | NaN |
2000-01-06 | NaN | NaN | NaN | NaN | NaN | NaN |
2000-01-07 | NaN | NaN | NaN | NaN | NaN | NaN |
2000-01-08 | 0.992010 | 1.992010 | 0.058131 | 1.058131 | 0.508194 | 1.508194 |
2000-01-09 | 0.926343 | 1.926343 | 0.835812 | 0.164188 | 0.893153 | 1.893153 |
2000-01-10 | 1.927433 | -0.927433 | 0.699219 | 0.300781 | 1.873901 | 2.873901 |
传入字典
传递函数的dict将允许对每个列进行选择性转换。
tsdf.transform({'A': np.abs, 'B': lambda x: x + 1})
A | B | |
---|---|---|
2000-01-01 | 0.285947 | 0.656536 |
2000-01-02 | 0.258947 | 0.256572 |
2000-01-03 | 1.006815 | 0.572082 |
2000-01-04 | NaN | NaN |
2000-01-05 | NaN | NaN |
2000-01-06 | NaN | NaN |
2000-01-07 | NaN | NaN |
2000-01-08 | 0.992010 | 1.058131 |
2000-01-09 | 0.926343 | 0.164188 |
2000-01-10 | 1.927433 | 0.300781 |
对元素级别操作的函数
由于并非所有函数都可以向量化(接受NumPy数组并返回另一个数组或值),
因此DataFrame上的applymap()
方法和Series上的map()
类似地接受任何传入单个值并返回单个值的Python函数。
例如:
df4 = pd.DataFrame(np.random.randn(4,3),index=['a','b','c','d'],columns=['one','two','three'])
df4.iloc[0,2]=np.nan
df4.iloc[3,0]=np.nan
df4
one | two | three | |
---|---|---|---|
a | 0.121490 | -0.137162 | NaN |
b | -1.329558 | -1.154316 | -2.019252 |
c | -0.402782 | -2.080388 | 0.148970 |
d | NaN | -0.755811 | 0.648054 |
def f(x):
return len(str(x))
df4['one'].map(f)
a 18
b 19
c 20
d 3
Name: one, dtype: int64
df4.applymap(f)
one | two | three | |
---|---|---|---|
a | 18 | 20 | 3 |
b | 19 | 19 | 18 |
c | 20 | 19 | 18 |
d | 3 | 19 | 18 |
map()有一个附加的特性;它可以用来方便地“链接”或“映射”由secondary series定义的值。这与 merging/joining functionality
密切相关:
s = pd.Series(['six','seven','six','seven','six'],
index=['a','b','c','d','e'])
t = pd.Series({'six':6,'seven':7})
t
six 6
seven 7
dtype: int64
s
a six
b seven
c six
d seven
e six
dtype: object
s.map(t)
a 6
b 7
c 6
d 7
e 6
dtype: int64