第七章数据清洗与准备
7.1 处理缺失值
pandas对象的所有描述性统计信息默认情况下是排除缺失值的。 对于数值型数据,pandas使用浮点值NaN(Not a Number来表示缺失值)。 在pandas中,我们采用了R语言中的编程惯例,将缺失值成为NA,意思是not available(不可用)
string_data = pd. Series( [ 'aardvark' , 'artichoke' , np. nan, 'avocado' ] )
string_data
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
string_data. isnull( )
0 False
1 False
2 True
3 False
dtype: bool
string_data[ 0 ] = None
string_data. isnull( )
0 True
1 False
2 True
3 False
dtype: bool
函数名 描述 dropna 根据每个标签的值是否是缺失数据来筛选轴标签,并根据允许丢失的数据量来确定阈值 fillna 用某些值填充缺失的数据或使用插值方法(如’ffill’或’bfill’). isnull 返回表明哪些值是缺失值的布尔值 notnull isnull的反函数
7.1.1 过滤缺失值
from numpy import nan as NA
data = pd. Series( [ 1 , NA, 3.5 , NA, 7 ] )
data
0 1.0
1 NaN
2 3.5
3 NaN
4 7.0
dtype: float64
data. dropna( )
0 1.0
2 3.5
4 7.0
dtype: float64
data[ data. notnull( ) ]
0 1.0
2 3.5
4 7.0
dtype: float64
data = pd. DataFrame( [ [ 1 , 6.5 , 3 ] , [ 1 , NA, NA] , [ NA, NA, NA] , [ NA, 6.5 , 3 ] ] )
data
0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0
data. dropna( how= 'all' )
0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 3 NaN 6.5 3.0
data[ 4 ] = NA
data
0 1 2 4 0 1.0 6.5 3.0 NaN 1 1.0 NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN 6.5 3.0 NaN
data. dropna( axis= 1 , how = 'all' )
0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0
df = pd. DataFrame( np. random. randn( 7 , 3 ) )
df
0 1 2 0 -0.100288 0.117081 0.629897 1 0.145224 0.827820 -0.197561 2 -1.372610 -0.521075 0.783224 3 -0.679339 0.355698 -1.283404 4 -1.587708 0.254616 0.149215 5 -0.323276 -0.393636 -1.828212 6 -0.639610 -1.677821 1.618943
df. iloc[ : 4 , 1 ] = NA
df. iloc[ : 2 , 2 ] = NA
df
0 1 2 0 -0.100288 NaN NaN 1 0.145224 NaN NaN 2 -1.372610 NaN 0.783224 3 -0.679339 NaN -1.283404 4 -1.587708 0.254616 0.149215 5 -0.323276 -0.393636 -1.828212 6 -0.639610 -1.677821 1.618943
df. dropna( )
0 1 2 4 -1.587708 0.254616 0.149215 5 -0.323276 -0.393636 -1.828212 6 -0.639610 -1.677821 1.618943
df. dropna( thresh = 2 )
0 1 2 2 -1.372610 NaN 0.783224 3 -0.679339 NaN -1.283404 4 -1.587708 0.254616 0.149215 5 -0.323276 -0.393636 -1.828212 6 -0.639610 -1.677821 1.618943
7.1.2 补全缺失值
主要使用fillna方法来补全缺失值。调用fillna时,可以使用一个常数来替代缺失值 fillna函数参数
参数 描述 value 标量值或字典型对象用于填充缺失值 method 插值方法,如果没有其他参数,默认是’ffill’ axis 需要填充的轴,默认axis=0 inplace 修改被调用的对象,而不是生成一个备份 limit 用于前向或后向填充时最大的填充范围
df. fillna( 0 )
0 1 2 0 -0.100288 0.000000 0.000000 1 0.145224 0.000000 0.000000 2 -1.372610 0.000000 0.783224 3 -0.679339 0.000000 -1.283404 4 -1.587708 0.254616 0.149215 5 -0.323276 -0.393636 -1.828212 6 -0.639610 -1.677821 1.618943
df. fillna( { 1 : 0.5 , 2 : 0 } )
0 1 2 0 -0.100288 0.500000 0.000000 1 0.145224 0.500000 0.000000 2 -1.372610 0.500000 0.783224 3 -0.679339 0.500000 -1.283404 4 -1.587708 0.254616 0.149215 5 -0.323276 -0.393636 -1.828212 6 -0.639610 -1.677821 1.618943
_ = df. fillna( 0 , inplace = True )
df
0 1 2 0 -0.100288 0.000000 0.000000 1 0.145224 0.000000 0.000000 2 -1.372610 0.000000 0.783224 3 -0.679339 0.000000 -1.283404 4 -1.587708 0.254616 0.149215 5 -0.323276 -0.393636 -1.828212 6 -0.639610 -1.677821 1.618943
df = pd. DataFrame( np. random. randn( 6 , 3 ) )
df
0 1 2 0 -0.428405 0.199383 0.354342 1 0.019782 0.921389 0.534736 2 -0.583158 0.390681 -2.386976 3 -0.076475 -0.034995 1.635065 4 0.528814 0.711717 0.696243 5 -0.193577 0.162206 -0.520191
df. iloc[ 2 : , 1 ] = NA
df
0 1 2 0 -0.428405 0.199383 0.354342 1 0.019782 0.921389 0.534736 2 -0.583158 NaN -2.386976 3 -0.076475 NaN 1.635065 4 0.528814 NaN 0.696243 5 -0.193577 NaN -0.520191
df. iloc[ 4 : , 2 ] = NA
df
0 1 2 0 -0.428405 0.199383 0.354342 1 0.019782 0.921389 0.534736 2 -0.583158 NaN -2.386976 3 -0.076475 NaN 1.635065 4 0.528814 NaN NaN 5 -0.193577 NaN NaN
df. fillna( method= 'ffill' )
0 1 2 0 -0.428405 0.199383 0.354342 1 0.019782 0.921389 0.534736 2 -0.583158 0.921389 -2.386976 3 -0.076475 0.921389 1.635065 4 0.528814 0.921389 1.635065 5 -0.193577 0.921389 1.635065
df. fillna( method= 'backfill' )
0 1 2 0 -0.428405 0.199383 0.354342 1 0.019782 0.921389 0.534736 2 -0.583158 NaN -2.386976 3 -0.076475 NaN 1.635065 4 0.528814 NaN NaN 5 -0.193577 NaN NaN
df. fillna( method= 'ffill' , limit= 2 )
0 1 2 0 -0.428405 0.199383 0.354342 1 0.019782 0.921389 0.534736 2 -0.583158 0.921389 -2.386976 3 -0.076475 0.921389 1.635065 4 0.528814 NaN 1.635065 5 -0.193577 NaN 1.635065
data = pd. Series( [ 5 , NA, 3 , NA, 7 ] )
data. fillna( data. mean( ) )
0 5.0
1 5.0
2 3.0
3 5.0
4 7.0
dtype: float64
data. mean( )
5.0
7.2 数据转换
7.2.1 删除重复值
data = pd. DataFrame( { 'k1' : [ 'one' , 'two' ] * 3 + [ 'two' ] ,
'k2' : [ 1 , 1 , 2 , 3 , 3 , 4 , 4 , ] } )
data
k1 k2 0 one 1 1 two 1 2 one 2 3 two 3 4 one 3 5 two 4 6 two 4
data. duplicated( )
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
data. drop_duplicates( )
k1 k2 0 one 1 1 two 1 2 one 2 3 two 3 4 one 3 5 two 4
data[ 'v1' ] = range ( 7 )
data
k1 k2 v1 0 one 1 0 1 two 1 1 2 one 2 2 3 two 3 3 4 one 3 4 5 two 4 5 6 two 4 6
data. drop_duplicates( [ 'k1' ] )
data. drop_duplicates( [ 'k1' , 'k2' ] , keep = 'last' )
k1 k2 v1 0 one 1 0 1 two 1 1 2 one 2 2 3 two 3 3 4 one 3 4 6 two 4 6
7.2.2 使用函数或映射进行数据转换
data = pd. DataFrame( { 'food' : [ 'bacon' , 'pulled pork' , 'bacon' , 'pastrami' , 'corned beef' ,
'bacon' , 'pastrami' , 'honey ham' , 'nova lox' ] ,
'ounces' : [ 4.0 , 3.0 , 12.0 , 6.0 , 7.5 , 8.0 , 3.0 , 5.0 , 6.0 ] } )
data
food ounces 0 bacon 4.0 1 pulled pork 3.0 2 bacon 12.0 3 pastrami 6.0 4 corned beef 7.5 5 bacon 8.0 6 pastrami 3.0 7 honey ham 5.0 8 nova lox 6.0
meat_to_animal = {
'bacon' : 'pig' ,
'pulled pork' : 'pig' ,
'pastrami' : 'cow' ,
'corned beef' : 'cow' ,
'honey ham' : 'pig' ,
'nova lox' : 'samlon' ,
}
lowercased = data[ 'food' ] . str . lower( )
lowercased
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
data[ 'animal' ] = lowercased. map ( meat_to_animal)
data
food ounces animal 0 bacon 4.0 pig 1 pulled pork 3.0 pig 2 bacon 12.0 pig 3 pastrami 6.0 cow 4 corned beef 7.5 cow 5 bacon 8.0 pig 6 pastrami 3.0 cow 7 honey ham 5.0 pig 8 nova lox 6.0 samlon
data[ 'food' ] . map ( lambda x : meat_to_animal[ x. lower( ) ] )
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 samlon
Name: food, dtype: object
7.2.3 替代值
data.replace方法与data.str.replace方法是不同的,data.str. replace是对字符串进行按元素替代的。
data = pd. Series( [ 1 , - 999 , 2 , - 999 , - 1000 , 3 ] )
data
0 1
1 -999
2 2
3 -999
4 -1000
5 3
dtype: int64
data. replace( - 999 , np. nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
data. replace( [ - 999 , - 1000 ] , np. nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
data. replace( [ - 999 , - 1000 ] , [ np. nan, 0 ] )
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
data. replace( { - 999 : np. nan, - 1000 : 0 } )
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
7.2.4 重命名轴索引
data = pd. DataFrame( np. arange( 12 ) . reshape( 3 , 4 ) ,
index = [ 'Ohio' , 'Colorado' , 'New York' ] ,
columns = [ 'one' , 'two' , 'three' , 'four' ] )
data
one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 New York 8 9 10 11
transform = lambda x : x[ : 4 ] . upper( )
data. index. map ( transform)
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
data. index = data. index. map ( transform)
data
one two three four OHIO 0 1 2 3 COLO 4 5 6 7 NEW 8 9 10 11
data. rename( index = str . title, columns = str . upper)
ONE TWO THREE FOUR Ohio 0 1 2 3 Colo 4 5 6 7 New 8 9 10 11
data. rename( index = { 'OHIO' : 'INDIANA' } ,
columns = { 'three' : 'peekaboo' } )
one two peekaboo four INDIANA 0 1 2 3 COLO 4 5 6 7 NEW 8 9 10 11
7.2.5 离散化和分箱
ages = [ 20 , 22 , 24 , 27 , 21 , 23 , 37 , 31 , 61 , 45 , 41 , 32 ]
bins = [ 18 , 25 , 35 , 60 , 100 ]
cats = pd. cut( ages, bins)
cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
cats. codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
cats. categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
closed='right',
dtype='interval[int64]')
pd. value_counts( cats)
(18, 25] 5
(25, 35] 3
(35, 60] 3
(60, 100] 1
dtype: int64
pd. cut( ages, [ 18 , 26 , 36 , 61 , 100 ] , right = False )
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
group_names = [ 'youth' , 'youngadult' , 'middleaged' , 'senior' ]
a = pd. cut( ages, bins, labels = group_names)
pd. value_counts( a)
youth 5
youngadult 3
middleaged 3
senior 1
dtype: int64
data = np. random. rand( 20 )
pd. cut( data, 4 , precision= 2 )
[(0.51, 0.74], (0.29, 0.51], (0.74, 0.97], (0.29, 0.51], (0.06, 0.29], ..., (0.06, 0.29], (0.29, 0.51], (0.74, 0.97], (0.51, 0.74], (0.74, 0.97]]
Length: 20
Categories (4, interval[float64]): [(0.06, 0.29] < (0.29, 0.51] < (0.51, 0.74] < (0.74, 0.97]]
data = np. random. randn( 1000 )
cats = pd. qcut( data, 4 )
cats
[(-0.00707, 0.65], (-0.00707, 0.65], (-2.936, -0.626], (-0.626, -0.00707], (-2.936, -0.626], ..., (-0.626, -0.00707], (-0.626, -0.00707], (-0.626, -0.00707], (-0.626, -0.00707], (-0.00707, 0.65]]
Length: 1000
Categories (4, interval[float64]): [(-2.936, -0.626] < (-0.626, -0.00707] < (-0.00707, 0.65] < (0.65, 3.139]]
pd. value_counts( cats)
(-2.936, -0.626] 250
(-0.626, -0.00707] 250
(-0.00707, 0.65] 250
(0.65, 3.139] 250
dtype: int64
pd. cut( data, [ 0 , 0.1 , 0.5 , 0.9 , 1 ] )
[(0.5, 0.9], (0.1, 0.5], NaN, NaN, NaN, ..., NaN, NaN, NaN, NaN, (0.1, 0.5]]
Length: 1000
Categories (4, interval[float64]): [(0.0, 0.1] < (0.1, 0.5] < (0.5, 0.9] < (0.9, 1.0]]
7.2.6 检测和过滤异常值
data = pd. DataFrame( np. random. randn( 1000 , 4 ) )
data. describe( )
0 1 2 3 count 1000.000000 1000.000000 1000.000000 1000.000000 mean 0.013343 0.030142 0.020312 0.042330 std 1.012528 0.984443 0.999869 0.982124 min -2.942920 -3.799121 -3.412855 -2.632107 25% -0.668303 -0.629645 -0.654843 -0.643005 50% 0.010349 0.040064 0.026197 0.028003 75% 0.701525 0.679371 0.706170 0.714993 max 3.274496 3.998493 3.264216 2.907744
col = data[ 2 ]
col[ np. abs ( col) > 3 ]
91 -3.044972
711 3.264216
858 -3.412855
Name: 2, dtype: float64
data[ ( np. abs ( data) > 3 ) . any ( 1 ) ]
0 1 2 3 91 -0.341046 -0.555910 -3.044972 0.474512 325 2.233400 -3.027404 0.845704 1.441757 332 -0.460361 -3.799121 -0.312931 0.478548 457 0.011004 3.998493 0.977419 0.577620 711 -0.603762 -1.650901 3.264216 -0.803395 746 1.455624 -3.178085 -0.387140 0.859193 858 -2.127923 0.163924 -3.412855 -0.073186 946 3.274496 -0.699596 -1.016879 0.358252
data[ np. abs ( data) > 3 ] = np. sign( data) * 3
data. describe( )
0 1 2 3 count 1000.000000 1000.000000 1000.000000 1000.000000 mean 0.013069 0.030148 0.020506 0.042330 std 1.011680 0.977459 0.997573 0.982124 min -2.942920 -3.000000 -3.000000 -2.632107 25% -0.668303 -0.629645 -0.654843 -0.643005 50% 0.010349 0.040064 0.026197 0.028003 75% 0.701525 0.679371 0.706170 0.714993 max 3.000000 3.000000 3.000000 2.907744
data
0 1 2 3 0 0.997285 0.352539 -0.158277 -0.069519 1 -1.144523 -0.173312 -0.651227 0.686972 2 0.650131 0.271325 -0.304344 -0.281217 3 0.527442 -2.023765 0.827982 -1.855424 4 -0.578451 -0.949705 -0.582701 -1.725697 ... ... ... ... ... 995 0.494311 0.528862 -0.191097 0.118121 996 -0.582154 1.251247 -1.622055 -0.436563 997 0.687732 -1.670059 -0.272708 -0.369290 998 -0.443230 0.984728 -0.283506 -1.473420 999 -0.276277 -0.597256 1.269391 -0.704337
1000 rows × 4 columns
np. sign( data) . head( )
0 1 2 3 0 1.0 1.0 -1.0 -1.0 1 -1.0 -1.0 -1.0 1.0 2 1.0 1.0 -1.0 -1.0 3 1.0 -1.0 1.0 -1.0 4 -1.0 -1.0 -1.0 -1.0
7.2.7 置换和随机抽样
使用numpy.random.permutation对DataFrame中的Series或行进行置换(随机重排序)是非常方便的。
sampler = np. random. permutation( 5 )
sampler
array([3, 2, 0, 4, 1])
df = pd. DataFrame( np. arange( 5 * 4 ) . reshape( 5 , 4 ) )
df
0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 3 12 13 14 15 4 16 17 18 19
df. take( sampler)
0 1 2 3 3 12 13 14 15 2 8 9 10 11 0 0 1 2 3 4 16 17 18 19 1 4 5 6 7
df. sample( n= 3 )
0 1 2 3 0 0 1 2 3 4 16 17 18 19 3 12 13 14 15
choices = pd. Series( [ 5 , 6 , - 1 , 6 , 4 ] )
draws = choices. sample( n= 10 , replace = True )
draws
2 -1
0 5
2 -1
3 6
0 5
1 6
1 6
4 4
3 6
1 6
dtype: int64
7.2.8 计算指标/虚拟变量
将分类变量转换为“虚拟”或“指标”矩阵是另一种用于统计建模或机器学习的转换操作 pandas有一个get_dummies函数用于实现该功能
df = pd. DataFrame( { 'key' : [ 'b' , 'b' , 'a' , 'c' , 'a' , 'b' ] ,
'data1' : range ( 6 ) } )
df
key data1 0 b 0 1 b 1 2 a 2 3 c 3 4 a 4 5 b 5
pd. get_dummies( df[ 'key' ] )
a b c 0 0 1 0 1 0 1 0 2 1 0 0 3 0 0 1 4 1 0 0 5 0 1 0
dummies = pd. get_dummies( df[ 'key' ] , prefix= 'key' )
df_with_dummy = df[ [ 'data1' ] ] . join( dummies)
df_with_dummy
data1 key_a key_b key_c 0 0 0 1 0 1 1 0 1 0 2 2 1 0 0 3 3 0 0 1 4 4 1 0 0 5 5 0 1 0
mnames = [ 'movie_id' , 'title' , 'genres' ]
movies = pd. read_table( r'D:\PythonFlie\python\利用python进行数据分析(书籍笔记)\pydata-book-2nd-edition\datasets\movielens\movies.dat'
, sep= '::' , header= None , names = mnames)
<ipython-input-188-960ac40c2eea>:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
movies = pd.read_table(r'D:\PythonFlie\python\利用python进行数据分析(书籍笔记)\pydata-book-2nd-edition\datasets\movielens\movies.dat'
movies[ : : 10 ]
movie_id title genres 0 1 Toy Story (1995) Animation|Children's|Comedy 10 11 American President, The (1995) Comedy|Drama|Romance 20 21 Get Shorty (1995) Action|Comedy|Drama 30 31 Dangerous Minds (1995) Drama 40 41 Richard III (1995) Drama|War ... ... ... ... 3840 3910 Dancer in the Dark (2000) Drama|Musical 3850 3920 Faraway, So Close (In Weiter Ferne, So Nah!) (... Drama|Fantasy 3860 3930 Creature From the Black Lagoon, The (1954) Horror 3870 3940 Slumber Party Massacre III, The (1990) Horror 3880 3950 Tigerland (2000) Drama
389 rows × 3 columns
all_genres = [ ]
for x in movies. genres:
all_genres. extend( x. split( '|' ) )
genres = pd. unique( all_genres)
genres
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
'Western'], dtype=object)
zero_matrix = np. zeros( ( len ( movies) , len ( genres) ) )
dummies = pd. DataFrame( zero_matrix, columns= genres)
zero_matrix
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
dummies
Animation Children's Comedy Adventure Fantasy Romance Drama Action Crime Thriller Horror Sci-Fi Documentary War Musical Mystery Film-Noir Western 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 3878 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3879 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3880 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3881 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3882 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3883 rows × 18 columns
gen = movies. genres[ 0 ]
gen. split( '|' )
['Animation', "Children's", 'Comedy']
dummies. columns. get_indexer( gen. split( "|" ) )
array([0, 1, 2], dtype=int64)
for i, gen in enumerate ( movies. genres) :
indices = dummies. columns. get_indexer( gen. split( "|" ) )
dummies. iloc[ i, indices] = 1
movies_windic = movies. join( dummies. add_prefix( 'Genre_' ) )
movies_windic. iloc[ 0 ]
movie_id 1
title Toy Story (1995)
genres Animation|Children's|Comedy
Genre_Animation 1.0
Genre_Children's 1.0
Genre_Comedy 1.0
Genre_Adventure 0.0
Genre_Fantasy 0.0
Genre_Romance 0.0
Genre_Drama 0.0
Genre_Action 0.0
Genre_Crime 0.0
Genre_Thriller 0.0
Genre_Horror 0.0
Genre_Sci-Fi 0.0
Genre_Documentary 0.0
Genre_War 0.0
Genre_Musical 0.0
Genre_Mystery 0.0
Genre_Film-Noir 0.0
Genre_Western 0.0
Name: 0, dtype: object
np. random. seed( 12345 )
values = np. random. rand( 10 )
values
array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])
bins = [ 0 , 0.2 , 0.4 , 0.6 , 0.8 , 1 ]
pd. get_dummies( pd. cut( values, bins) )
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0] 0 0 0 0 0 1 1 0 1 0 0 0 2 1 0 0 0 0 3 0 1 0 0 0 4 0 0 1 0 0 5 0 0 1 0 0 6 0 0 0 0 1 7 0 0 0 1 0 8 0 0 0 1 0 9 0 0 0 1 0
7.3 字符串操作
7.3.1 字符串对象方法
val = 'a,b, guido'
val. split( ',' )
['a', 'b', ' guido']
pieces = [ x. strip( ) for x in val. split( ',' ) ]
pieces
['a', 'b', 'guido']
first, second, third = pieces
first+ "::" + second+ "::" + third
'a::b::guido'
"::" . join( pieces)
'a::b::guido'
'guido' in val
True
val. index( 'guido' )
5
val. find( 'guido' )
5
val. count( 'guido' )
1
val. replace( ',' , '::' )
'a::b:: guido'
val. replace( ',' , '' )
'ab guido'
方法 描述 count 返回子字符串在字符串中的非重叠出现次数 endswith 如果字符串以后缀结尾则返回True startswith 如果字符串以前缀开始则返回True join 使用字符串作为间隔符,用于粘合其他字符串的序列 index 如果在字符串中找到,则返回子字符串中第一个字符的位置:如果找不到则引发ValueError find 返回字符串中第一个出现子字符的第一个字符的位置:类似index,但如果没有找到则返回-1 rfind 返回子字符串在字符串中最后一次出现时第一个字符的位置,如果没有找到,则返回-1 replace 使用一个字符串替代另一个字符串 strip,rstrip,1strip 修剪空白,包括换行符,相当于对每个元素进行x. strip() (以及rstrip,lstrip)。 split 使用分隔符将字符串拆分为子字符串的列表 lower 将大写字母转换为小写字母 upper 将小写字母转换为大写字母 casefold 将字符转换为小写,并将任何特定于区域的变量字符组合转换为常见的可比较形式 ljust, rjust 左对齐或右对齐;用空格(或其他一些字符)填充字符串的相反侧以返回且,有最小宽度的字符串
7.3.2 正则表达式
re模块主要有三个主题:模式匹配、替代、拆分 描述一个或多个空白字符的正则表达式是\s+ 如果你需要将相同的表达式应用到多个字符串上,推荐使用re.compile创建一个正则表达式对象,这样做有利于节约CPU周期。 为了在正则表达式中避免转义符\的影响,可以使用原生字符串语法,比如r’C:\x’或者用等价的’C:\x’ 正则表达式方法
方法 描述 findall 将字符串中所有的非重叠匹配模式以列表形式返回 finditer 与findall类似,但返回的是迭代器 match 在字符串起始位置匹配模式,也可以将模式组建匹配到分组中;如果模式匹配上了,返回的一个匹配对象,否则返回None search 扫描字符串的匹配模式,如果扫描到了返回匹配对象,与match方法不同的是,search 方法的匹配可以是字符串的任意位置,而不仅仅是字符串的起始位置 split 根据模式,将字符串拆分为多个部分 sub,subn 用替换表达式替换字符串中所有的匹配(sub) 或第n个出现的匹配串(subn);使用符号\ 1. \ 2 …来引用替换字符串中的匹配组元素
import re
text = 'foo bar\t baz \tqux'
re. split( '\s+' , text)
['foo', 'bar', 'baz', 'qux']
regex = re. compile ( '\s+' )
regex. split( text)
['foo', 'bar', 'baz', 'qux']
regex. findall( text)
[' ', '\t ', ' \t']
text = """
Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9.%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex = re. compile ( pattern, flags = re. IGNORECASE)
regex. findall( text)
['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']
m = regex. search( text)
m
<re.Match object; span=(6, 21), match='dave@google.com'>
text[ m. start( ) : m. end( ) ]
'dave@google.com'
print ( regex. match( text) )
None
print ( regex. sub( 'ABC' , text) )
Dave ABC
Steve ABC
Rob ABC
Ryan ABC
pattern = r'([A-Z0-9.%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re. compile ( pattern, flags = re. IGNORECASE)
m = regex. match( 'wesm@bright.net' )
m. groups( )
('wesm', 'bright', 'net')
regex. findall( text)
[('dave', 'google', 'com'),
('steve', 'gmail', 'com'),
('rob', 'gmail', 'com'),
('ryan', 'yahoo', 'com')]
print ( regex. sub( r'Username:\1,Domain:\2,Suffix:\3' , text) )
Dave Username:dave,Domain:google,Suffix:com
Steve Username:steve,Domain:gmail,Suffix:com
Rob Username:rob,Domain:gmail,Suffix:com
Ryan Username:ryan,Domain:yahoo,Suffix:com
7.3.3 pandas中的向量化字符串函数
方法 描述 cat 根据可选的分隔符按元素黏合字符串 contains 返回是否含有某个模式/正则表达式的布尔值数组 count 模式出现次数的计数 extract 使用正则表达式从字符串Scries 中分组抽取-个或多个字符串;返回的结果是每个分组形成-列的DataFrame endswith 等价于对每个元素使用x. endwith (模式)
data = { 'Dave' : 'dave@google.com' , 'Steve' : 'steve@gmail.com' , 'Rob' : 'rob@gmail.com' , 'Ryan' : np. nan}
data = pd. Series( data)
data
Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan NaN
dtype: object
data. isnull( )
Dave False
Steve False
Rob False
Ryan True
dtype: bool
data. str . contains( 'gmail' )
Dave False
Steve True
Rob True
Ryan NaN
dtype: object
data. str . findall( pattern, flags= re. IGNORECASE)
Dave [(dave, google, com)]
Steve [(steve, gmail, com)]
Rob [(rob, gmail, com)]
Ryan NaN
dtype: object
matches = data. str . match( pattern, flags= re. IGNORECASE)
matches
Dave True
Steve True
Rob True
Ryan NaN
dtype: object
matches. str . get( l)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-245-8d76f9329d2a> in <module>
1 #要访问嵌入式列表中的元素,我们可以将索引传递给这些函数中的任意一个
----> 2 matches.str.get(l)
D:\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5459 or name in self._accessors
5460 ):
-> 5461 return object.__getattribute__(self, name)
5462 else:
5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
D:\Anaconda3\lib\site-packages\pandas\core\accessor.py in __get__(self, obj, cls)
178 # we're accessing the attribute of the class, i.e., Dataset.geo
179 return self._accessor
--> 180 accessor_obj = self._accessor(obj)
181 # Replace the property with the accessor object. Inspired by:
182 # https://www.pydanny.com/cached-property.html
D:\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py in __init__(self, data)
152 from pandas.core.arrays.string_ import StringDtype
153
--> 154 self._inferred_dtype = self._validate(data)
155 self._is_categorical = is_categorical_dtype(data.dtype)
156 self._is_string = isinstance(data.dtype, StringDtype)
D:\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py in _validate(data)
215
216 if inferred_dtype not in allowed_types:
--> 217 raise AttributeError("Can only use .str accessor with string values!")
218 return inferred_dtype
219
AttributeError: Can only use .str accessor with string values!
matches. str [ 0 ]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-246-10bdd22fd8b2> in <module>
----> 1 matches.str[0]
D:\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5459 or name in self._accessors
5460 ):
-> 5461 return object.__getattribute__(self, name)
5462 else:
5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
D:\Anaconda3\lib\site-packages\pandas\core\accessor.py in __get__(self, obj, cls)
178 # we're accessing the attribute of the class, i.e., Dataset.geo
179 return self._accessor
--> 180 accessor_obj = self._accessor(obj)
181 # Replace the property with the accessor object. Inspired by:
182 # https://www.pydanny.com/cached-property.html
D:\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py in __init__(self, data)
152 from pandas.core.arrays.string_ import StringDtype
153
--> 154 self._inferred_dtype = self._validate(data)
155 self._is_categorical = is_categorical_dtype(data.dtype)
156 self._is_string = isinstance(data.dtype, StringDtype)
D:\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py in _validate(data)
215
216 if inferred_dtype not in allowed_types:
--> 217 raise AttributeError("Can only use .str accessor with string values!")
218 return inferred_dtype
219
AttributeError: Can only use .str accessor with string values!
data. str [ : 5 ]
Dave dave@
Steve steve
Rob rob@g
Ryan NaN
dtype: object
第八章数据规整:连接、联合与重塑
8.1 分层索引
data = pd. Series( np. random. randn( 9 ) ,
index = [ [ 'a' , 'a' , 'a' , 'b' , 'b' , 'c' , 'c' , 'd' , 'd' ] ,
[ 1 , 2 , 3 , 1 , 3 , 1 , 2 , 2 , 3 ] ] )
data
a 1 1.007189
2 -1.296221
3 0.274992
b 1 0.228913
3 1.352917
c 1 0.886429
2 -2.001637
d 2 -0.371843
3 1.669025
dtype: float64
data. index
MultiIndex([('a', 1),
('a', 2),
('a', 3),
('b', 1),
('b', 3),
('c', 1),
('c', 2),
('d', 2),
('d', 3)],
)
data[ 'b' ]
1 0.228913
3 1.352917
dtype: float64
data[ 'b' : 'c' ]
b 1 0.228913
3 1.352917
c 1 0.886429
2 -2.001637
dtype: float64
data. loc[ [ 'b' , 'c' ] ]
b 1 0.228913
3 1.352917
c 1 0.886429
2 -2.001637
dtype: float64
data. loc[ : , 3 ]
a 0.274992
b 1.352917
d 1.669025
dtype: float64
data. unstack( )
1 2 3 a 1.007189 -1.296221 0.274992 b 0.228913 NaN 1.352917 c 0.886429 -2.001637 NaN d NaN -0.371843 1.669025
data. unstack( ) . stack( )
a 1 1.007189
2 -1.296221
3 0.274992
b 1 0.228913
3 1.352917
c 1 0.886429
2 -2.001637
d 2 -0.371843
3 1.669025
dtype: float64
frame = pd. DataFrame( np. arange( 12 ) . reshape( 4 , 3 ) ,
index = [ [ 'a' , 'a' , 'b' , 'b' ] , [ 1 , 2 , 1 , 2 ] ] ,
columns = [ [ 'ohio' , 'ohio' , 'colorado' ] , [ 'green' , 'red' , 'green' ] ] )
frame
ohio colorado green red green a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11
frame. index. names = [ 'key1' , 'key2' ]
frame
ohio colorado green red green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11
frame. columns. names = [ 'state' , 'color' ]
frame
state ohio colorado color green red green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11
frame[ 'ohio' ]
color green red key1 key2 a 1 0 1 2 3 4 b 1 6 7 2 9 10
pd. MultiIndex. from_arrays( [ [ 'ohio' , 'ohio' , 'colorado' ] , [ 'green' , 'red' , 'green' ] ] ,
names = [ 'state' , 'color' ] )
MultiIndex([( 'ohio', 'green'),
( 'ohio', 'red'),
('colorado', 'green')],
names=['state', 'color'])
8.1.1 重排序和层级排序
frame. swaplevel( 'key1' , 'key2' )
state ohio colorado color green red green key2 key1 1 a 0 1 2 2 a 3 4 5 1 b 6 7 8 2 b 9 10 11
frame. sort_index( level= 1 )
state ohio colorado color green red green key1 key2 a 1 0 1 2 b 1 6 7 8 a 2 3 4 5 b 2 9 10 11
frame. sort_index( level= 0 )
state ohio colorado color green red green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11
frame. swaplevel( 0 , 1 ) . sort_index( level= 0 )
state ohio colorado color green red green key2 key1 1 a 0 1 2 b 6 7 8 2 a 3 4 5 b 9 10 11
8.1.2 按层级进行汇总统计
DataFrame和Series中很多描述性和汇总性统计有一个level选项,通过level选项你可以指定你想要在某个特定的轴上进行聚合
frame. sum ( level= 'key2' )
state ohio colorado color green red green key2 1 6 8 10 2 12 14 16
frame. sum ( level= 'color' , axis= 1 )
color green red key1 key2 a 1 2 1 2 8 4 b 1 14 7 2 20 10
8.1.3 使用DataFrame的列进行索引
通常我们不会使用DataFrame中一个或多个列作为行索引;反而你可能想要将行索引移动到DataFrame的列中。 how参数的不同连接类型
选项 行为 ’ inner’ 只对两张表都有的键的交集进行联合 ‘left’ 对所有左表的键进行联合 'right ’ 对所有右表的键进行联合 ’ outer’ 对两张表都有的键的并集进行联合
frame = pd. DataFrame( { 'a' : range ( 7 ) , 'b' : range ( 7 , 0 , - 1 ) ,
'c' : [ 'one' , 'one' , 'one' , 'two' , 'two' , 'two' , 'two' ] ,
'd' : [ 0 , 1 , 2 , 0 , 1 , 2 , 3 ] } )
frame
a b c d 0 0 7 one 0 1 1 6 one 1 2 2 5 one 2 3 3 4 two 0 4 4 3 two 1 5 5 2 two 2 6 6 1 two 3
frame2 = frame. set_index( [ 'c' , 'd' ] )
frame2
a b c d one 0 0 7 1 1 6 2 2 5 two 0 3 4 1 4 3 2 5 2 3 6 1
frame. set_index( [ 'c' , 'd' ] , drop = False )
a b c d c d one 0 0 7 one 0 1 1 6 one 1 2 2 5 one 2 two 0 3 4 two 0 1 4 3 two 1 2 5 2 two 2 3 6 1 two 3
frame2. reset_index( )
c d a b 0 one 0 0 7 1 one 1 1 6 2 one 2 2 5 3 two 0 3 4 4 two 1 4 3 5 two 2 5 2 6 two 3 6 1
8.2 联合与合并数据集
pandas.merge根据一个或多个键将行进行连接。对于SQL或其他关系型数据库的用户来说,这种方式比较熟悉,它实现的是数据库的连接操作。 pandas.concat使对象在轴向上进行黏合或“堆叠”。 combine_first实例方法允许将重叠的数据拼接在一起,以使用一个对象中的值填充另一个对象中的缺失值。
8.2.1 数据库风格的DataFrame连接
合并或连接操作通过一个或多个键连接行来联合数据集 merge函数参数
参数 描述 left 合并时操作中左边的DataFrame right 合并时操作中右边的DataFrame how ‘inner’. ‘outer’. ‘left’. ‘right’之一; 默认是’ inner’ on 需要连接的列名。必须是在两边的DataFrame对象都有的列名,并以left和right中的列名的交集作为连接键 left_on 1eft DataFrame 中用作连接键的列 right_on right DataFrame 中用作连接键的列 left_index 使用left的行索引作为它的连接键(如果是Multilndex,则是多个键) right_index 使用right的行索引作为它的连接键(如果是MultiIndex,则是多个键) sort 通过连接键按字母顺序对合并的数据进行排序;在默认情况下为True (在大数据集上某些情况下禁用该功能可以获得更好的性能) suffixes 在重叠情况下,添加到列名后的字符串元组;默认是(’. x’,’ y’) (例如如果待合并的DataFrame中都含有’data’ 列,那么结果中会出现’data_x’、‘data_ y’) copy 如果为False,则在某些特殊情况下避免将数据复制到结果数据结构中;默认情况下总是复制 indicator 添加一个特殊的列_ merge, 指示每一行的来源;值将根据每行中连接数据的来源分别为’left_ only’,‘right_ only’ 或’ both’
df1 = pd. DataFrame( { 'key' : [ 'b' , 'b' , 'a' , 'c' , 'a' , 'a' , 'b' ] ,
'data1' : range ( 7 ) } )
df1
key data1 0 b 0 1 b 1 2 a 2 3 c 3 4 a 4 5 a 5 6 b 6
df2 = pd. DataFrame( { 'key' : [ 'a' , 'b' , 'd' ] ,
'data2' : range ( 3 ) } )
df2
pd. merge( df1, df2)
key data1 data2 0 b 0 1 1 b 1 1 2 b 6 1 3 a 2 0 4 a 4 0 5 a 5 0
pd. merge( df1, df2, on= 'key' )
key data1 data2 0 b 0 1 1 b 1 1 2 b 6 1 3 a 2 0 4 a 4 0 5 a 5 0
df3 = pd. DataFrame( { 'Lkey' : [ 'b' , 'b' , 'a' , 'c' , 'a' , 'a' , 'b' ] ,
'data1' : range ( 7 ) } )
df3
Lkey data1 0 b 0 1 b 1 2 a 2 3 c 3 4 a 4 5 a 5 6 b 6
df4 = pd. DataFrame( { 'Rkey' : [ 'a' , 'b' , 'd' ] ,
'data2' : range ( 3 ) } )
df4
pd. merge( df3, df4, left_on= 'Lkey' , right_on= 'Rkey' )
Lkey data1 Rkey data2 0 b 0 b 1 1 b 1 b 1 2 b 6 b 1 3 a 2 a 0 4 a 4 a 0 5 a 5 a 0
pd. merge( df1, df2, how= 'outer' )
key data1 data2 0 b 0.0 1.0 1 b 1.0 1.0 2 b 6.0 1.0 3 a 2.0 0.0 4 a 4.0 0.0 5 a 5.0 0.0 6 c 3.0 NaN 7 d NaN 2.0
df1 = pd. DataFrame( { 'key' : [ 'b' , 'b' , 'a' , 'c' , 'a' , 'b' ] ,
'data1' : range ( 6 ) } )
df1
key data1 0 b 0 1 b 1 2 a 2 3 c 3 4 a 4 5 b 5
df2 = pd. DataFrame( { 'key' : [ 'a' , 'b' , 'a' , 'b' , 'd' ] ,
'data2' : range ( 5 ) } )
df2
pd. merge( df1, df2, on= 'key' , how= 'left' )
key data1 data2 0 b 0 1.0 1 b 0 3.0 2 b 1 1.0 3 b 1 3.0 4 a 2 0.0 5 a 2 2.0 6 c 3 NaN 7 a 4 0.0 8 a 4 2.0 9 b 5 1.0 10 b 5 3.0
pd. merge( df1, df2, on= 'key' , how= 'inner' )
key data1 data2 0 b 0 1 1 b 0 3 2 b 1 1 3 b 1 3 4 b 5 1 5 b 5 3 6 a 2 0 7 a 2 2 8 a 4 0 9 a 4 2
left = pd. DataFrame( { 'key1' : [ 'foo' , 'foo' , 'bar' ] ,
'key2' : [ 'one' , 'two' , 'one' ] ,
'lval' : [ 1 , 2 , 3 ] } )
left
key1 key2 lval 0 foo one 1 1 foo two 2 2 bar one 3
right = pd. DataFrame( { 'key1' : [ 'foo' , 'foo' , 'bar' , 'bar' ] ,
'key2' : [ 'one' , 'one' , 'one' , 'two' ] ,
'rval' : [ 4 , 5 , 6 , 7 ] } )
right
key1 key2 rval 0 foo one 4 1 foo one 5 2 bar one 6 3 bar two 7
pd. merge( left, right, on= [ 'key1' , 'key2' ] , how= 'outer' )
key1 key2 lval rval 0 foo one 1.0 4.0 1 foo one 1.0 5.0 2 foo two 2.0 NaN 3 bar one 3.0 6.0 4 bar two NaN 7.0
pd. merge( left, right, on= [ 'key1' ] )
key1 key2_x lval key2_y rval 0 foo one 1 one 4 1 foo one 1 one 5 2 foo two 2 one 4 3 foo two 2 one 5 4 bar one 3 one 6 5 bar one 3 two 7
pd. merge( left, right, on= [ 'key1' ] , suffixes= ( '_left' , '_right' ) )
key1 key2_left lval key2_right rval 0 foo one 1 one 4 1 foo one 1 one 5 2 foo two 2 one 4 3 foo two 2 one 5 4 bar one 3 one 6 5 bar one 3 two 7
8.2.2 根据索引合并
在某些情况下,DataFrame中用于合并的键是它的索引。在这种情况下,你可以传递left_index=True或right_index=True(或者都传)来表示索引需要用来作为合并的键
left1 = pd. DataFrame( { 'key' : [ 'a' , 'b' , 'a' , 'a' , 'b' , 'c' ] ,
'value' : range ( 6 ) } )
left1
key value 0 a 0 1 b 1 2 a 2 3 a 3 4 b 4 5 c 5
right1 = pd. DataFrame( { 'group_val' : [ 3.5 , 7 ] } , index = [ 'a' , 'b' ] )
right1
pd. merge( left1, right1, left_on = 'key' , right_index= True )
key value group_val 0 a 0 3.5 2 a 2 3.5 3 a 3 3.5 1 b 1 7.0 4 b 4 7.0
pd. merge( left1, right1, left_on = 'key' , right_index= True , how= 'outer' )
key value group_val 0 a 0 3.5 2 a 2 3.5 3 a 3 3.5 1 b 1 7.0 4 b 4 7.0 5 c 5 NaN
lefth = pd. DataFrame( { 'key1' : [ 'ohio' , 'ohio' , 'ohio' , 'Nevada' , 'Nevada' ] ,
'key2' : [ 2000 , 2001 , 2002 , 2001 , 2002 ] ,
'data' : np. arange( 5 . ) } )
lefth
key1 key2 data 0 ohio 2000 0.0 1 ohio 2001 1.0 2 ohio 2002 2.0 3 Nevada 2001 3.0 4 Nevada 2002 4.0
righth = pd. DataFrame( np. arange( 12 ) . reshape( 6 , 2 ) ,
index= [ [ 'nevada' , 'nevada' , 'ohio' , 'ohio' , 'ohio' , 'ohio' ] , [ 2001 , 2000 , 2000 , 2000 , 2001 , 2002 ] ] ,
columns = [ 'event1' , 'event2' ] )
righth
event1 event2 nevada 2001 0 1 2000 2 3 ohio 2000 4 5 2000 6 7 2001 8 9 2002 10 11
pd. merge( lefth, righth, left_on= [ 'key1' , 'key2' ] , right_index= True , how= 'outer' )
key1 key2 data event1 event2 0 ohio 2000 0.0 4.0 5.0 0 ohio 2000 0.0 6.0 7.0 1 ohio 2001 1.0 8.0 9.0 2 ohio 2002 2.0 10.0 11.0 3 Nevada 2001 3.0 NaN NaN 4 Nevada 2002 4.0 NaN NaN 4 nevada 2001 NaN 0.0 1.0 4 nevada 2000 NaN 2.0 3.0
left2 = pd. DataFrame( [ [ 1 , 2 ] , [ 3 , 4 ] , [ 5 , 6 ] ] ,
index = [ 'a' , 'c' , 'e' ] ,
columns = [ 'ohio' , 'nevada' ] )
left2
right2 = pd. DataFrame( [ [ 7 , 8 ] , [ 9 , 10 ] , [ 11 , 12 ] , [ 13 , 14 ] ] ,
index = [ 'b' , 'c' , 'd' , 'e' ] ,
columns = [ 'missouri' , 'alabama' ] )
right2
missouri alabama b 7 8 c 9 10 d 11 12 e 13 14
pd. merge( left2, right2, how= 'outer' , left_index= True , right_index= True )
ohio nevada missouri alabama a 1.0 2.0 NaN NaN b NaN NaN 7.0 8.0 c 3.0 4.0 9.0 10.0 d NaN NaN 11.0 12.0 e 5.0 6.0 13.0 14.0
left2. join( right2, how= 'outer' )
ohio nevada missouri alabama a 1.0 2.0 NaN NaN b NaN NaN 7.0 8.0 c 3.0 4.0 9.0 10.0 d NaN NaN 11.0 12.0 e 5.0 6.0 13.0 14.0
left1. join( right1, on= 'key' )
key value group_val 0 a 0 3.5 1 b 1 7.0 2 a 2 3.5 3 a 3 3.5 4 b 4 7.0 5 c 5 NaN
another = pd. DataFrame( [ [ 7 , 8 ] , [ 9 , 10 ] , [ 11 , 12 ] , [ 16 , 17 ] ] ,
index = [ 'a' , 'c' , 'e' , 'f' ] ,
columns = [ 'new york' , 'oregon' ] )
another
new york oregon a 7 8 c 9 10 e 11 12 f 16 17
left2. join( [ right2, another] )
ohio nevada missouri alabama new york oregon a 1.0 2.0 NaN NaN 7.0 8.0 c 3.0 4.0 9.0 10.0 9.0 10.0 e 5.0 6.0 13.0 14.0 11.0 12.0
left2. join( [ right2, another] , how= 'outer' )
ohio nevada missouri alabama new york oregon a 1.0 2.0 NaN NaN 7.0 8.0 c 3.0 4.0 9.0 10.0 9.0 10.0 e 5.0 6.0 13.0 14.0 11.0 12.0 b NaN NaN 7.0 8.0 NaN NaN d NaN NaN 11.0 12.0 NaN NaN f NaN NaN NaN NaN 16.0 17.0
8.2.3 沿轴向连接
另一种数据组合操作可互换地称为拼接、绑定或堆叠。 NumPy的concatenate函数可以在NumPy数组上实现该功能 concat函数的参数
参数 描述 objs 需要连接的pandas对象列表或字典,这是必选参数 axis 连接的轴向;默认是0 (沿着行方向) join 可以是’inner’或’outer’ (默认是’outer’);用于指定连接方式是内连接(inner) 还是外连接(outer) join_ _axes 用于指定其他n-1轴的特定索引,可以替代内/外连接的逻辑 keys 与要连接的对象关联的值,沿着连接轴形成分层索引;可以是任意值的列表或数组,也可以是元组的数组,也可以是数组的列表(如果向levels参数传入多层数组) leels 在键值传递时,该参数用于指定多层索引的层级
arr = np. arange( 12 ) . reshape( 3 , 4 )
arr
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
np. concatenate( [ arr, arr] , axis= 1 )
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])
s1 = pd. Series( [ 0 , 1 ] , index= [ 'a' , 'b' ] )
s1
a 0
b 1
dtype: int64
s2 = pd. Series( [ 2 , 3 , 4 ] , index= [ 'c' , 'd' , 'e' ] )
s2
c 2
d 3
e 4
dtype: int64
s3 = pd. Series( [ 5 , 6 ] , index= [ 'f' , 'g' ] )
s3
f 5
g 6
dtype: int64
pd. concat( [ s1, s2, s3] )
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
pd. concat( [ s1, s2, s3] , axis= 1 )
0 1 2 a 0.0 NaN NaN b 1.0 NaN NaN c NaN 2.0 NaN d NaN 3.0 NaN e NaN 4.0 NaN f NaN NaN 5.0 g NaN NaN 6.0
s4 = pd. concat( [ s1, s3] )
s4
a 0
b 1
f 5
g 6
dtype: int64
pd. concat( [ s1, s4] , axis= 1 )
pd. concat( [ s1, s4] , axis= 1 , join= 'inner' )
pd. concat( [ s1, s4] , axis= 1 , join_axes= [ [ 'a' , 'c' , 'b' , 'e' ] ] )
result = pd. concat( [ s1, s2, s3] , keys= [ 'one' , 'two' , 'three' ] )
result
one a 0
b 1
two c 2
d 3
e 4
three f 5
g 6
dtype: int64
result. unstack( )
a b c d e f g one 0.0 1.0 NaN NaN NaN NaN NaN two NaN NaN 2.0 3.0 4.0 NaN NaN three NaN NaN NaN NaN NaN 5.0 6.0
pd. concat( [ s1, s2, s3] , axis = 1 , keys= [ 'one' , 'two' , 'three' ] )
one two three a 0.0 NaN NaN b 1.0 NaN NaN c NaN 2.0 NaN d NaN 3.0 NaN e NaN 4.0 NaN f NaN NaN 5.0 g NaN NaN 6.0
df1 = pd. DataFrame( np. arange( 6 ) . reshape( 3 , 2 ) ,
index = [ 'a' , 'b' , 'c' ] ,
columns = [ 'one' , 'two' ] )
df1
df2 = pd. DataFrame( np. arange( 4 ) . reshape( 2 , 2 ) + 5 ,
index = [ 'a' , 'c' ] ,
columns = [ 'three' , 'four' ] )
df2
pd. concat( [ df1, df2] , axis= 1 , keys= [ 'lever1' , 'level2' ] )
lever1 level2 one two three four a 0 1 5.0 6.0 b 2 3 NaN NaN c 4 5 7.0 8.0
pd. concat( { 'level1' : df1, 'level2' : df2} , axis= 1 )
level1 level2 one two three four a 0 1 5.0 6.0 b 2 3 NaN NaN c 4 5 7.0 8.0
pd. concat( [ df1, df2] , axis= 1 , keys= [ 'lever1' , 'level2' ] , names= [ 'upper' , 'lower' ] )
upper lever1 level2 lower one two three four a 0 1 5.0 6.0 b 2 3 NaN NaN c 4 5 7.0 8.0
df1 = pd. DataFrame( np. random. randn( 3 , 4 ) , columns = [ 'a' , 'b' , 'c' , 'd' ] )
df1
a b c d 0 -1.119593 1.953114 -1.514807 -1.054782 1 0.543393 1.172903 0.945829 0.656643 2 1.012695 1.481920 -0.413033 -1.280521
df2 = pd. DataFrame( np. random. randn( 2 , 3 ) , columns = [ 'b' , 'd' , 'a' ] )
df2
b d a 0 1.638046 -0.850112 1.895532 1 -1.175952 1.370474 -0.992356
pd. concat( [ df1, df2] , ignore_index= True )
a b c d 0 -1.119593 1.953114 -1.514807 -1.054782 1 0.543393 1.172903 0.945829 0.656643 2 1.012695 1.481920 -0.413033 -1.280521 3 1.895532 1.638046 NaN -0.850112 4 -0.992356 -1.175952 NaN 1.370474
8.2.4 联合重叠数据
a = pd. Series( [ np. nan, 2.5 , 0 , 3.5 , 4.5 , np. nan] ,
index= [ 'f' , 'e' , 'd' , 'c' , 'b' , 'a' ] )
a
f NaN
e 2.5
d 0.0
c 3.5
b 4.5
a NaN
dtype: float64
b = pd. Series( [ 0 , np. nan, 2 , np. nan, np. nan, 5 ] ,
index= [ 'a' , 'b' , 'c' , 'd' , 'e' , 'f' ] )
b
a 0.0
b NaN
c 2.0
d NaN
e NaN
f 5.0
dtype: float64
np. where( pd. isnull( a) , b, a)
array([0. , 2.5, 0. , 3.5, 4.5, 5. ])
b. combine_first( a)
a 0.0
b 4.5
c 2.0
d 0.0
e 2.5
f 5.0
dtype: float64
df1 = pd. DataFrame( { 'a' : [ 1 , np. nan, 5 , np. nan] ,
'b' : [ np. nan, 2 , np. nan, 6 ] ,
'c' : range ( 2 , 18 , 4 ) } )
df1
a b c 0 1.0 NaN 2 1 NaN 2.0 6 2 5.0 NaN 10 3 NaN 6.0 14
df2 = pd. DataFrame( { 'a' : [ 5 , 4 , np. nan, 3 , 7 ] ,
'b' : [ np. nan, 3 , 4 , 6 , 8 ] } )
df2
a b 0 5.0 NaN 1 4.0 3.0 2 NaN 4.0 3 3.0 6.0 4 7.0 8.0
df1. combine_first( df2)
a b c 0 1.0 NaN 2.0 1 4.0 2.0 6.0 2 5.0 4.0 10.0 3 3.0 6.0 14.0 4 7.0 8.0 NaN
8.3 重塑和透视
重排列表格型数据有多种基础操作。这些操作被称为重塑或透视。
8.3.1 使用多层索引进行重塑
statck(堆叠)该操作会“旋转”或将列中的数据透视到行。 unstack(拆堆)该操作会将行中的数据透视到列。
data = pd. DataFrame( np. arange( 6 ) . reshape( 2 , 3 ) ,
index = pd. Index( [ 'ohio' , 'colorado' ] , name= 'state' ) ,
columns = pd. Index( [ 'one' , 'two' , 'three' ] , name= 'number' ) )
data
number one two three state ohio 0 1 2 colorado 3 4 5
result = data. stack( )
result
state number
ohio one 0
two 1
three 2
colorado one 3
two 4
three 5
dtype: int32
result. unstack( )
number one two three state ohio 0 1 2 colorado 3 4 5
result. unstack( 0 )
state ohio colorado number one 0 3 two 1 4 three 2 5
result. unstack( 'state' )
state ohio colorado number one 0 3 two 1 4 three 2 5
s1 = pd. Series( [ 0 , 1 , 2 , 3 ] , index= [ 'a' , 'b' , 'c' , 'd' ] )
s1
a 0
b 1
c 2
d 3
dtype: int64
s2 = pd. Series( [ 4 , 5 , 6 ] , index= [ 'c' , 'd' , 'e' ] )
s2
c 4
d 5
e 6
dtype: int64
data = pd. concat( [ s1, s2] , keys= [ 'one' , 'two' ] )
data
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
data. unstack( )
a b c d e one 0.0 1.0 2.0 3.0 NaN two NaN NaN 4.0 5.0 6.0
data. unstack( ) . stack( )
one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
data. unstack( ) . stack( dropna = False )
one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64
df = pd. DataFrame( { 'left' : result, 'right' : result+ 5 } ,
columns= pd. Index( [ 'left' , 'right' ] , name= 'side' ) )
df
side left right state number ohio one 0 5 two 1 6 three 2 7 colorado one 3 8 two 4 9 three 5 10
df. unstack( 'state' )
side left right state ohio colorado ohio colorado number one 0 3 5 8 two 1 4 6 9 three 2 5 7 10
df. unstack( 'state' ) . stack( 'side' )
state colorado ohio number side one left 3 0 right 8 5 two left 4 1 right 9 6 three left 5 2 right 10 7
df. unstack( 'state' ) . stack( )
side left right number state one ohio 0 5 colorado 3 8 two ohio 1 6 colorado 4 9 three ohio 2 7 colorado 5 10
8.3.2 将“长”透视为“宽”
data = pd. read_csv( 'examples/macrodata.csv' )
data. head( )
year quarter realgdp realcons realinv realgovt realdpi cpi m1 tbilrate unemp pop infl realint 0 1959.0 1.0 2710.349 1707.4 286.898 470.045 1886.9 28.98 139.7 2.82 5.8 177.146 0.00 0.00 1 1959.0 2.0 2778.801 1733.7 310.859 481.301 1919.7 29.15 141.7 3.08 5.1 177.830 2.34 0.74 2 1959.0 3.0 2775.488 1751.8 289.226 491.260 1916.4 29.35 140.5 3.82 5.3 178.657 2.74 1.09 3 1959.0 4.0 2785.204 1753.7 299.356 484.052 1931.3 29.37 140.0 4.33 5.6 179.386 0.27 4.06 4 1960.0 1.0 2847.699 1770.5 331.722 462.199 1955.5 29.54 139.6 3.50 5.2 180.007 2.31 1.19
periods = pd. PeriodIndex( year= data. year, quarter= data. quarter, name= 'date' )
periods
PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
'1960Q3', '1960Q4', '1961Q1', '1961Q2',
...
'2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
'2008Q4', '2009Q1', '2009Q2', '2009Q3'],
dtype='period[Q-DEC]', name='date', length=203, freq='Q-DEC')
columns = pd. Index( [ 'realgdp' , 'infl' , 'unemp' ] , name= 'item' )
columns
Index(['realgdp', 'infl', 'unemp'], dtype='object', name='item')
data = data. reindex( columns= columns)
data. head( )
item realgdp infl unemp 0 2710.349 NaN 5.8 1 2778.801 NaN 5.1 2 2775.488 NaN 5.3 3 2785.204 NaN 5.6 4 2847.699 NaN 5.2
data. index = periods. to_timestamp( 'D' , 'end' )
data. head( )
item realgdp infl unemp date 1959-03-31 23:59:59.999999999 2710.349 NaN 5.8 1959-06-30 23:59:59.999999999 2778.801 NaN 5.1 1959-09-30 23:59:59.999999999 2775.488 NaN 5.3 1959-12-31 23:59:59.999999999 2785.204 NaN 5.6 1960-03-31 23:59:59.999999999 2847.699 NaN 5.2
ldata = data. stack( ) . reset_index( ) . rename( columns= { 0 : 'value' } )
ldata[ : 10 ]
date item value 0 1959-03-31 23:59:59.999999999 realgdp 2710.349 1 1959-03-31 23:59:59.999999999 unemp 5.800 2 1959-06-30 23:59:59.999999999 realgdp 2778.801 3 1959-06-30 23:59:59.999999999 unemp 5.100 4 1959-09-30 23:59:59.999999999 realgdp 2775.488 5 1959-09-30 23:59:59.999999999 unemp 5.300 6 1959-12-31 23:59:59.999999999 realgdp 2785.204 7 1959-12-31 23:59:59.999999999 unemp 5.600 8 1960-03-31 23:59:59.999999999 realgdp 2847.699 9 1960-03-31 23:59:59.999999999 unemp 5.200
pivoted = ldata. pivot( 'date' , 'item' , 'value' )
pivoted
item realgdp unemp date 1959-03-31 23:59:59.999999999 2710.349 5.8 1959-06-30 23:59:59.999999999 2778.801 5.1 1959-09-30 23:59:59.999999999 2775.488 5.3 1959-12-31 23:59:59.999999999 2785.204 5.6 1960-03-31 23:59:59.999999999 2847.699 5.2 ... ... ... 2008-09-30 23:59:59.999999999 13324.600 6.0 2008-12-31 23:59:59.999999999 13141.920 6.9 2009-03-31 23:59:59.999999999 12925.410 8.1 2009-06-30 23:59:59.999999999 12901.504 9.2 2009-09-30 23:59:59.999999999 12990.341 9.6
203 rows × 2 columns
8.3.3 将“宽”透视为“长”
在DataFrame中,pivot方法的反操作是pandas.melt
df = pd. DataFrame( { 'key' : [ 'foo' , 'bar' , 'baz' ] ,
'A' : [ 1 , 2 , 3 ] ,
'B' : [ 4 , 5 , 6 ] ,
'C' : [ 7 , 8 , 9 ] } )
df
key A B C 0 foo 1 4 7 1 bar 2 5 8 2 baz 3 6 9
melted = pd. melt( df)
melted
variable value 0 key foo 1 key bar 2 key baz 3 A 1 4 A 2 5 A 3 6 B 4 7 B 5 8 B 6 9 C 7 10 C 8 11 C 9
melted = pd. melt( df, [ 'key' ] )
melted
key variable value 0 foo A 1 1 bar A 2 2 baz A 3 3 foo B 4 4 bar B 5 5 baz B 6 6 foo C 7 7 bar C 8 8 baz C 9
reshaped = melted. pivot( 'key' , 'variable' , 'value' )
reshaped
variable A B C key bar 2 5 8 baz 3 6 9 foo 1 4 7
reshaped. reset_index( )
variable key A B C 0 bar 2 5 8 1 baz 3 6 9 2 foo 1 4 7
pd. melt( df, id_vars= [ 'key' ] , value_vars= [ 'A' , 'B' ] )
key variable value 0 foo A 1 1 bar A 2 2 baz A 3 3 foo B 4 4 bar B 5 5 baz B 6
pd. melt( df, value_vars= [ 'A' , 'B' , 'C' ] )
variable value 0 A 1 1 A 2 2 A 3 3 B 4 4 B 5 5 B 6 6 C 7 7 C 8 8 C 9
pd. melt( df, value_vars= [ 'A' , 'B' , 'key' ] )
variable value 0 A 1 1 A 2 2 A 3 3 B 4 4 B 5 5 B 6 6 key foo 7 key bar 8 key baz
第九章绘图与可视化
% matplotlib notebook
9.1 简明matplotlib API入门
使用Jupyter notebook时有个细节需要注意,在每个单元格运行后,图表被重置,因此对于更复杂的图表,你必须将所有的绘图命令放在单个的notebook单元格中
import matplotlib. pyplot as plt
import numpy as np
data = np. arange( 10 )
data
array( [ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ] )
plt. plot( data)
9.1.1 图片与子图
matplotlib所绘制的图位于图片(Figure)对象中 pyplot.subplots选项
参数 描述 nrows 子图的行数 ncols 子图的列数 sharex 所有子图使用相同的x轴刻度(调整xlim会影响所有子图) sharey 所有子图使用相同的y轴刻度(调整ylim会影响所有子图) subplot_ kw 传入add_ subplot 的关键字参数字典,用于生成子图 **fig_ _kW 在生成图片时使用的额外关键字参数,例如plt. subplots (2,2,figsize= (8,6))
fig = plt. figure( )
ax1 = fig. add_subplot( 2 , 2 , 1 )
ax2 = fig. add_subplot( 2 , 2 , 2 )
ax3 = fig. add_subplot( 2 , 2 , 3 )
<IPython.core.display.Javascript object>