【Jupyter notebook 】更好的阅读体验!
接上一部分
轴向连接
另一种数据合并运算也被称作连接(concatenation)、绑定(binding)或堆叠(stacking)。NumPy的concatenation函数可以用NumPy数组来做:
import numpy as np
import pandas as pd
arr = np. arange( 12 ) . reshape( ( 3 , 4 ) )
arr
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
np. concatenate( [ arr, arr] , axis= 1 )
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])
np. concatenate( [ arr, arr] , axis= 0 )
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
pandas 的concat()函数
s1 = pd. Series( [ 0 , 1 ] , index= [ 'a' , 'b' ] )
s2 = pd. Series( [ 2 , 3 , 4 ] , index= [ 'c' , 'd' , 'e' ] )
s3 = pd. Series( [ 5 , 6 ] , index= [ 'f' , 'g' ] )
pd. concat( [ s1, s2, s3] )
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
默认情况下是axis=0,也可以传入参数在列上合并,这样结果就会变成一个DataFrame
pd. concat( [ s1, s2, s3] , axis= 1 , sort= False )
0 1 2 a 0.0 NaN NaN b 1.0 NaN NaN c NaN 2.0 NaN d NaN 3.0 NaN e NaN 4.0 NaN f NaN NaN 5.0 g NaN NaN 6.0
s4 = pd. concat( [ s1, s3] )
s4
a 0
b 1
f 5
g 6
dtype: int64
pd. concat( [ s1, s4] , axis= 1 , sort= False )
pd. concat( [ s1, s4] , axis= 1 , join= 'inner' )
通过join_axes指定要在其它轴上使用的索引:
pd. concat( [ s1, s4] , axis= 1 , join_axes= [ [ 'a' , 'c' , 'b' , 'e' ] ] )
0 1 a 0.0 0.0 c NaN NaN b 1.0 1.0 e NaN NaN
result = pd. concat( [ s1, s2, s3] , keys = [ 'one' , 'two' , 'three' ] )
result
one a 0
b 1
two c 2
d 3
e 4
three f 5
g 6
dtype: int64
result. unstack( )
a b c d e f g one 0.0 1.0 NaN NaN NaN NaN NaN two NaN NaN 2.0 3.0 4.0 NaN NaN three NaN NaN NaN NaN NaN 5.0 6.0
如果沿着axis=1 进行合并,则keys就会成为列头
pd. concat( [ s1, s2, s3] , axis= 1 , keys= [ 'one' , 'two' , 'three' ] , sort= False )
one two three a 0.0 NaN NaN b 1.0 NaN NaN c NaN 2.0 NaN d NaN 3.0 NaN e NaN 4.0 NaN f NaN NaN 5.0 g NaN NaN 6.0
df1 = pd. DataFrame( np. arange( 6 ) . reshape( ( 3 , 2 ) ) ,
index = [ 'a' , 'b' , 'c' ] ,
columns= [ 'one' , 'two' ] )
df2 = pd. DataFrame( 5 + np. arange( 4 ) . reshape( ( 2 , 2 ) ) ,
index = [ 'a' , 'c' ] ,
columns = [ 'three' , 'four' ] )
df1
df2
pd. concat( [ df1, df2] , axis= 1 , keys= [ 'level1' , 'level2' ] , sort= False )
level1 level2 one two three four a 0 1 5.0 6.0 b 2 3 NaN NaN c 4 5 7.0 8.0
如果传入的不是一个列表,而是一个字典,那么字典的键就会被当作keys的选项
pd. concat( { 'level1' : df1, 'level2' : df2} , axis= 1 , sort= False )
level1 level2 one two three four a 0 1 5.0 6.0 b 2 3 NaN NaN c 4 5 7.0 8.0
pd. concat( [ df1, df2] , axis= 1 , keys= [ 'level1' , 'level2' ] , names= [ 'upper' , 'lower' ] , sort= False )
upper level1 level2 lower one two three four a 0 1 5.0 6.0 b 2 3 NaN NaN c 4 5 7.0 8.0
concat的参数如下:
合并重叠数据集
a = pd. Series( [ np. nan, 2.5 , np. nan, 3.5 , 4.5 , np. nan] , index= [ 'f' , 'e' , 'd' , 'c' , 'b' , 'a' ] )
b = pd. Series( np. arange( len ( a) , dtype= np. float64) , index= a. index)
a
f NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
dtype: float64
b[ - 1 ] = np. nan
b
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a NaN
dtype: float64
np. where( pd. isnull( a) , b, a)
array([0. , 2.5, 2. , 3.5, 4.5, nan])
b[ : - 2 ] . combine_first( a[ 1 : ] )
a NaN
b 4.5
c 3.0
d 2.0
e 1.0
f 0.0
dtype: float64
注: np.where(condition, x, y)满足condition则输出x,否则输出y.上面的代码:如果a的元素是nan则输出b对应的元素
对于DataFrame,combine_first会做同样的事情
df1 = pd. DataFrame( { 'a' : [ 1 . , np. nan, 5 . , np. nan] ,
'b' : [ np. nan, 2 . , np. nan, 6 . ] ,
'c' : range ( 2 , 18 , 4 ) } )
df2 = pd. DataFrame( { 'a' : [ 5 . , 4 . , np. nan, 3 . , 7 . ] ,
'b' : [ np. nan, 3 . , 4 . , 6 . , 8 . ] } )
df1
a b c 0 1.0 NaN 2 1 NaN 2.0 6 2 5.0 NaN 10 3 NaN 6.0 14
df2
a b 0 5.0 NaN 1 4.0 3.0 2 NaN 4.0 3 3.0 6.0 4 7.0 8.0
df1. combine_first( df2)
a b c 0 1.0 NaN 2.0 1 4.0 2.0 6.0 2 5.0 4.0 10.0 3 3.0 6.0 14.0 4 7.0 8.0 NaN
8.3重塑和轴向旋转
重塑层次化索引
主要方法有:
stack():将列旋转为行 unstack():将行旋转为列
data = pd. DataFrame( np. arange( 6 ) . reshape( ( 2 , 3 ) ) ,
index= pd. Index( [ 'Ohio' , 'Colordo' ] , name= 'state' ) ,
columns= pd. Index( [ 'one' , 'two' , 'three' ] , name= 'number' ) )
data
number one two three state Ohio 0 1 2 Colordo 3 4 5
result = data. stack( )
result
state number
Ohio one 0
two 1
three 2
Colordo one 3
two 4
three 5
dtype: int32
result. unstack( )
number one two three state Ohio 0 1 2 Colordo 3 4 5
默认情况下,unstack操作的是最内层(stack也是如此)。传入分层级别的编号或名称即可对其它级别进行unstack操作:
result. unstack( 0 )
state Ohio Colordo number one 0 3 two 1 4 three 2 5
result. unstack( 'state' )
state Ohio Colordo number one 0 3 two 1 4 three 2 5
s1 = pd. Series( [ 0 , 1 , 2 , 3 ] , index= [ 'a' , 'b' , 'c' , 'd' ] )
s2 = pd. Series( [ 4 , 5 , 6 ] , index= [ 'c' , 'd' , 'e' ] )
data = pd. concat( [ s1, s2] , keys= [ 'one' , 'two' ] )
data
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
data. unstack( )
a b c d e one 0.0 1.0 2.0 3.0 NaN two NaN NaN 4.0 5.0 6.0
data. unstack( ) . stack( )
one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
data. unstack( ) . stack( dropna= False )
one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64
DataFrame进行unstack操作时,作为旋转轴的级别将会成为结果中的最低级别:
df = pd. DataFrame( { 'left' : result, 'right' : result + 5 } ,
columns= pd. Index( [ 'left' , 'right' ] , name= 'side' ) )
df
side left right state number Ohio one 0 5 two 1 6 three 2 7 Colordo one 3 8 two 4 9 three 5 10
df. unstack( 'state' )
side left right state Ohio Colordo Ohio Colordo number one 0 3 5 8 two 1 4 6 9 three 2 5 7 10
df. unstack( 'state' ) . stack( 'side' )
state Colordo Ohio number side one left 3 0 right 8 5 two left 4 1 right 9 6 three left 5 2 right 10 7
将长格式转换为宽格式
多个时间序列数据通常是以所谓的“长格式”(long)或“堆叠格式”(stacked)存储在数据库和CSV中的。
data = pd. read_csv( 'data/examples/macrodata.csv' )
data. head( )
year quarter realgdp realcons realinv realgovt realdpi cpi m1 tbilrate unemp pop infl realint 0 1959.0 1.0 2710.349 1707.4 286.898 470.045 1886.9 28.98 139.7 2.82 5.8 177.146 0.00 0.00 1 1959.0 2.0 2778.801 1733.7 310.859 481.301 1919.7 29.15 141.7 3.08 5.1 177.830 2.34 0.74 2 1959.0 3.0 2775.488 1751.8 289.226 491.260 1916.4 29.35 140.5 3.82 5.3 178.657 2.74 1.09 3 1959.0 4.0 2785.204 1753.7 299.356 484.052 1931.3 29.37 140.0 4.33 5.6 179.386 0.27 4.06 4 1960.0 1.0 2847.699 1770.5 331.722 462.199 1955.5 29.54 139.6 3.50 5.2 180.007 2.31 1.19
periods = pd. PeriodIndex( year= data. year, quarter= data. quarter, name= 'data' )
columns = pd. Index( [ 'realgdp' , 'infl' , 'unemp' ] , name= 'item' )
data = data. reindex( columns= columns)
data. index = periods. to_timestamp( 'D' , 'end' )
ldata = data. stack( ) . reset_index( ) . rename( columns= { 0 : 'value' } )
ldata[ : 5 ]
data item value 0 1959-03-31 realgdp 2710.349 1 1959-03-31 infl 0.000 2 1959-03-31 unemp 5.800 3 1959-06-30 realgdp 2778.801 4 1959-06-30 infl 2.340
pivoted = ldata. pivot( 'data' , 'item' , 'value' )
pivoted[ : 5 ]
item infl realgdp unemp data 1959-03-31 0.00 2710.349 5.8 1959-06-30 2.34 2778.801 5.1 1959-09-30 2.74 2775.488 5.3 1959-12-31 0.27 2785.204 5.6 1960-03-31 2.31 2847.699 5.2
将宽格式旋转为长格式
df = pd. DataFrame( { 'key' : [ 'foo' , 'bar' , 'baz' ] ,
'A' : [ 1 , 2 , 3 ] ,
'B' : [ 4 , 5 , 6 ] ,
'C' : [ 7 , 8 , 9 ] } )
df
key A B C 0 foo 1 4 7 1 bar 2 5 8 2 baz 3 6 9
melted = pd. melt( df, [ 'key' ] )
melted
key variable value 0 foo A 1 1 bar A 2 2 baz A 3 3 foo B 4 4 bar B 5 5 baz B 6 6 foo C 7 7 bar C 8 8 baz C 9
reshaped = melted. pivot( 'key' , 'variable' , 'value' )
reshaped
variable A B C key bar 2 5 8 baz 3 6 9 foo 1 4 7
reshaped. reset_index( )
variable key A B C 0 bar 2 5 8 1 baz 3 6 9 2 foo 1 4 7