2.3 汇总和计算描述统计
import numpy as np
import pandas as pd
df = pd. DataFrame( [ [ 1.4 , np. nan] ,
[ 7.1 , - 4.5 ] ,
[ np. nan, np. nan] ,
[ 0.75 , - 1.3 ] ] ,
index= list ( 'abcd' ) ,
columns= [ 'one' , 'two' ] )
df
one two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3
描述和汇总统计的一些方法
sum()
df. sum ( )
one 9.25
two -5.80
dtype: float64
df. sum ( axis= 'columns' )
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
NA值会⾃动被排除,除⾮整个切⽚(这⾥指的是⾏或列)都是NA,可以通过skipna禁用(skip NA)
mean
df. mean( axis= 1 )
a 1.400
b 1.300
c NaN
d -0.275
dtype: float64
df. mean( axis= 1 , skipna= False )
a NaN
b 1.300
c NaN
d -0.275
dtype: float64
约简型:idxmin/idxmax
df. idxmax( )
one b
two d
dtype: object
df. idxmin( )
one d
two b
dtype: object
累计型函数
df. cumsum( )
one two a 1.40 NaN b 8.50 -4.5 c NaN NaN d 9.25 -5.8
df. cummax( )
one two a 1.4 NaN b 7.1 -4.5 c NaN NaN d 7.1 -1.3
df. cummin( )
one two a 1.40 NaN b 1.40 -4.5 c NaN NaN d 0.75 -4.5
df. cumprod( )
one two a 1.400 NaN b 9.940 -4.50 c NaN NaN d 7.455 5.85
汇总统计:describe
df. describe( )
one two count 3.000000 2.000000 mean 3.083333 -2.900000 std 3.493685 2.262742 min 0.750000 -4.500000 25% 1.075000 -3.700000 50% 1.400000 -2.900000 75% 4.250000 -2.100000 max 7.100000 -1.300000
obj = pd. Series( [ 'a' , 'a' , 'b' , 'c' ] * 4 )
obj
0 a
1 a
2 b
3 c
4 a
5 a
6 b
7 c
8 a
9 a
10 b
11 c
12 a
13 a
14 b
15 c
dtype: object
obj. describe( )
count 16
unique 3
top a
freq 8
dtype: object
总结:描述和汇总统计
方法 说明 count 非NA值的总数 describe 针对Series和DataFrame列计算汇总统计 min,max 最小值和最大值 argmin, argmax 索引位置 idxmin, idxmax 索引值 quantile 计算样本的分位数(0到1) sum 值的总和 mean 值的平均数 median 值的算数中位数 mad 根据平均值计算平均绝对离差 var 样本值的方差 std 样本值的标准差 skew 样本值的偏度(三阶矩) kurt 样本值的峰度(四阶矩) cumsum 累计和 cumprod 累计积 cummin,cummax 样本值的累计最大值和累计最小值 diff 计算一阶差分(对时间序列很有用) pct_change 计算百分数变化
相关系数与协方差
import pandas_datareader. data as web
all_data = { ticker: web. get_data_yahoo( ticker) for ticker in [ 'AAPL' , 'IBM' , 'MSFT' , 'GOOG' ] }
all_data
{'AAPL': High Low Open Close Volume \
Date
2015-10-26 29.532499 28.730000 29.520000 28.820000 265335200.0
2015-10-27 29.135000 28.497499 28.850000 28.637501 279537600.0
2015-10-28 29.825001 29.014999 29.232500 29.817499 342205600.0
2015-10-29 30.172501 29.567499 29.674999 30.132500 204909200.0
2015-10-30 30.305000 29.862499 30.247499 29.875000 197461200.0
... ... ... ... ... ...
2020-10-19 120.419998 115.660004 119.959999 115.980003 120639300.0
2020-10-20 118.980003 115.629997 116.199997 117.510002 124423700.0
2020-10-21 118.709999 116.449997 116.669998 116.870003 89946000.0
2020-10-22 118.040001 114.589996 117.449997 115.750000 101988000.0
2020-10-23 116.550003 114.279999 116.389999 115.040001 82396600.0
Adj Close
Date
2015-10-26 26.632969
2015-10-27 26.464319
2015-10-28 27.554773
2015-10-29 27.845867
2015-10-30 27.607912
... ...
2020-10-19 115.980003
2020-10-20 117.510002
2020-10-21 116.870003
2020-10-22 115.750000
2020-10-23 115.040001
[1259 rows x 6 columns],
'IBM': High Low Open Close Volume \
Date
2015-10-26 145.000000 143.210007 144.750000 143.660004 3489400.0
2015-10-27 143.830002 137.330002 143.509995 137.860001 15008500.0
2015-10-28 141.229996 137.500000 137.919998 140.830002 8511400.0
2015-10-29 140.820007 139.020004 140.360001 140.550003 3710400.0
2015-10-30 141.330002 139.899994 140.440002 140.080002 3984000.0
... ... ... ... ... ...
2020-10-19 127.349998 125.080002 126.800003 125.519997 7478700.0
2020-10-20 120.150002 116.839996 119.800003 117.370003 21501100.0
2020-10-21 117.690002 114.790001 116.660004 115.059998 9755300.0
2020-10-22 116.059998 112.980003 115.000000 115.760002 7855800.0
2020-10-23 116.620003 115.529999 116.500000 116.000000 3890800.0
Adj Close
Date
2015-10-26 115.042992
2015-10-27 110.398384
2015-10-28 112.776749
2015-10-29 112.552536
2015-10-30 112.176132
... ...
2020-10-19 125.519997
2020-10-20 117.370003
2020-10-21 115.059998
2020-10-22 115.760002
2020-10-23 116.000000
[1259 rows x 6 columns],
'MSFT': High Low Open Close Volume \
Date
2015-10-26 54.320000 52.500000 52.529999 54.250000 64633300.0
2015-10-27 54.369999 53.580002 53.990002 53.689999 50999900.0
2015-10-28 53.980000 52.860001 53.540001 53.980000 47000800.0
2015-10-29 53.830002 53.220001 53.540001 53.360001 30202100.0
2015-10-30 53.990002 52.619999 53.320000 52.639999 46619800.0
... ... ... ... ... ...
2020-10-19 222.300003 213.720001 220.419998 214.220001 27625800.0
2020-10-20 217.369995 213.089996 215.800003 214.649994 22753500.0
2020-10-21 216.919998 213.119995 213.119995 214.800003 22724900.0
2020-10-22 216.059998 211.699997 213.929993 214.889999 22351500.0
2020-10-23 216.279999 213.160004 215.029999 216.229996 18841600.0
Adj Close
Date
2015-10-26 49.302139
2015-10-27 48.793213
2015-10-28 49.056763
2015-10-29 48.493317
2015-10-30 47.838978
... ...
2020-10-19 214.220001
2020-10-20 214.649994
2020-10-21 214.800003
2020-10-22 214.889999
2020-10-23 216.229996
[1259 rows x 6 columns],
'GOOG': High Low Open Close Volume \
Date
2015-10-26 719.150024 701.260010 701.549988 712.780029 2716600
2015-10-27 713.619995 704.549988 707.380005 708.489990 2245800
2015-10-28 712.979980 703.080017 707.330017 712.950012 2178900
2015-10-29 718.260010 710.010010 710.500000 716.919983 1456000
2015-10-30 718.000000 710.049988 715.729980 710.809998 1908800
... ... ... ... ... ...
2020-10-19 1588.150024 1528.000000 1580.459961 1534.609985 1607100
2020-10-20 1577.500000 1525.670044 1527.050049 1555.930054 2241700
2020-10-21 1618.729980 1571.630005 1573.329956 1593.310059 2568300
2020-10-22 1621.989990 1585.000000 1593.050049 1615.329956 1433600
2020-10-23 1642.359985 1620.510010 1626.069946 1641.000000 1375200
Adj Close
Date
2015-10-26 712.780029
2015-10-27 708.489990
2015-10-28 712.950012
2015-10-29 716.919983
2015-10-30 710.809998
... ...
2020-10-19 1534.609985
2020-10-20 1555.930054
2020-10-21 1593.310059
2020-10-22 1615.329956
2020-10-23 1641.000000
[1259 rows x 6 columns]}
price = pd. DataFrame( { ticker: data[ 'Adj Close' ] for ticker, data in all_data. items( ) } )
volumn = pd. DataFrame( { ticker: data[ 'Volume' ] for ticker, data in all_data. items( ) } )
price
AAPL IBM MSFT GOOG Date 2015-10-26 26.632969 115.042992 49.302139 712.780029 2015-10-27 26.464319 110.398384 48.793213 708.489990 2015-10-28 27.554773 112.776749 49.056763 712.950012 2015-10-29 27.845867 112.552536 48.493317 716.919983 2015-10-30 27.607912 112.176132 47.838978 710.809998 ... ... ... ... ... 2020-10-19 115.980003 125.519997 214.220001 1534.609985 2020-10-20 117.510002 117.370003 214.649994 1555.930054 2020-10-21 116.870003 115.059998 214.800003 1593.310059 2020-10-22 115.750000 115.760002 214.889999 1615.329956 2020-10-23 115.040001 116.000000 216.229996 1641.000000
1259 rows × 4 columns
volumn
AAPL IBM MSFT GOOG Date 2015-10-26 265335200.0 3489400.0 64633300.0 2716600 2015-10-27 279537600.0 15008500.0 50999900.0 2245800 2015-10-28 342205600.0 8511400.0 47000800.0 2178900 2015-10-29 204909200.0 3710400.0 30202100.0 1456000 2015-10-30 197461200.0 3984000.0 46619800.0 1908800 ... ... ... ... ... 2020-10-19 120639300.0 7478700.0 27625800.0 1607100 2020-10-20 124423700.0 21501100.0 22753500.0 2241700 2020-10-21 89946000.0 9755300.0 22724900.0 2568300 2020-10-22 101988000.0 7855800.0 22351500.0 1433600 2020-10-23 82396600.0 3890800.0 18841600.0 1375200
1259 rows × 4 columns
计算价格百分数变化
returns = price. pct_change( )
returns. tail( )
AAPL IBM MSFT GOOG Date 2020-10-19 -0.025542 -0.003256 -0.024766 -0.024412 2020-10-20 0.013192 -0.064930 0.002007 0.013893 2020-10-21 -0.005446 -0.019681 0.000699 0.024024 2020-10-22 -0.009583 0.006084 0.000419 0.013820 2020-10-23 -0.006134 0.002073 0.006236 0.015892
corr(相关系数)和cov(协方差)
returns. MSFT. corr( returns. IBM)
0.5721816449554378
returns. corr( )
AAPL IBM MSFT GOOG AAPL 1.000000 0.490404 0.716144 0.660741 IBM 0.490404 1.000000 0.572182 0.525325 MSFT 0.716144 0.572182 1.000000 0.780281 GOOG 0.660741 0.525325 0.780281 1.000000
returns. cov( )
AAPL IBM MSFT GOOG AAPL 0.000355 0.000149 0.000234 0.000205 IBM 0.000149 0.000260 0.000160 0.000139 MSFT 0.000234 0.000160 0.000301 0.000222 GOOG 0.000205 0.000139 0.000222 0.000270
returns. corrwith( returns. IBM)
AAPL 0.490404
IBM 1.000000
MSFT 0.572182
GOOG 0.525325
dtype: float64
returns. corrwith( returns. IBM, axis= 1 )
Date
2015-10-26 NaN
2015-10-27 NaN
2015-10-28 NaN
2015-10-29 NaN
2015-10-30 NaN
..
2020-10-19 NaN
2020-10-20 NaN
2020-10-21 NaN
2020-10-22 NaN
2020-10-23 NaN
Length: 1259, dtype: float64
唯一值、值计数以及成员资格
obj = pd. Series( [ 'c' , 'a' , 'd' , 'a' , 'a' , 'b' , 'b' , 'c' , 'c' ] )
unique()
uniques = obj. unique( )
uniques
array(['c', 'a', 'd', 'b'], dtype=object)
value_counts()
obj. value_counts( )
a 3
c 3
b 2
d 1
dtype: int64
pd. value_counts( obj. values, sort= False )
b 2
c 3
a 3
d 1
dtype: int64
isin
如果元素在内,则该位置标记为True,否则为False 可用来筛选出我们要的子集
obj
0 c
1 a
2 d
3 a
4 a
5 b
6 b
7 c
8 c
dtype: object
mask = obj. isin( [ 'b' , 'c' ] )
mask
0 True
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool
obj[ mask]
0 c
5 b
6 b
7 c
8 c
dtype: object
Index.get_indexer
original = pd. Series( [ 'c' , 'a' , 'b' , 'b' , 'c' , 'a' ] )
uniques = pd. Series( [ 'c' , 'a' , 'b' ] )
pd. Index( uniques) . get_indexer( original)
array([0, 1, 2, 2, 0, 1], dtype=int64)
data = pd. DataFrame( { 'Qu1' : [ 1 , 3 , 4 , 3 , 4 ] ,
'Qu2' : [ 2 , 3 , 1 , 2 , 3 ] ,
'Qu3' : [ 1 , 5 , 2 , 4 , 4 ] } )
data
Qu1 Qu2 Qu3 0 1 2 1 1 3 3 5 2 4 1 2 3 3 2 4 4 4 3 4
data. apply ( pd. value_counts) . fillna( 0 )
Qu1 Qu2 Qu3 1 1.0 1.0 1.0 2 0.0 2.0 1.0 3 2.0 2.0 0.0 4 2.0 0.0 2.0 5 0.0 0.0 1.0