2.3 pandas:汇总和计算描述统计(统计上的方法)

本文介绍了如何使用Python进行数据的描述统计,包括sum、mean、median等方法,以及df.describe()的全面汇总,涵盖了非数值型数据的处理。还展示了计算价格百分比变化、相关系数与协方差的应用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

2.3 汇总和计算描述统计

import numpy as np
import pandas as pd
df = pd.DataFrame([[1.4, np.nan],
                   [7.1,-4.5],
                   [np.nan, np.nan],
                   [0.75,-1.3]],
                 index=list('abcd'),
                 columns=['one','two'])
                  
df
onetwo
a1.40NaN
b7.10-4.5
cNaNNaN
d0.75-1.3

描述和汇总统计的一些方法

sum()
df.sum() #默认求列
one    9.25
two   -5.80
dtype: float64
df.sum(axis='columns')
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

NA值会⾃动被排除,除⾮整个切⽚(这⾥指的是⾏或列)都是NA,可以通过skipna禁用(skip NA)

mean
df.mean(axis=1) #a行排除了NA,所以是1.40/1=1.40
a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64
df.mean(axis=1,skipna=False)
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64
约简型:idxmin/idxmax
  • 返回的是索引
df.idxmax()
one    b
two    d
dtype: object
df.idxmin()
one    d
two    b
dtype: object
累计型函数
df.cumsum() #默认是列
onetwo
a1.40NaN
b8.50-4.5
cNaNNaN
d9.25-5.8
df.cummax() #默认是列,不断更新到当前行的最小值
onetwo
a1.4NaN
b7.1-4.5
cNaNNaN
d7.1-1.3
df.cummin() #默认是列,不断更新到当前行的最大值
onetwo
a1.40NaN
b1.40-4.5
cNaNNaN
d0.75-4.5
df.cumprod() #默认是列,累乘
onetwo
a1.400NaN
b9.940-4.50
cNaNNaN
d7.4555.85
汇总统计:describe
#对于数值型数据
df.describe()
onetwo
count3.0000002.000000
mean3.083333-2.900000
std3.4936852.262742
min0.750000-4.500000
25%1.075000-3.700000
50%1.400000-2.900000
75%4.250000-2.100000
max7.100000-1.300000
obj = pd.Series(['a','a','b','c']*4)
obj
0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object
#非数值型数据
obj.describe()
count     16
unique     3
top        a
freq       8
dtype: object
总结:描述和汇总统计
方法说明
count非NA值的总数
describe针对Series和DataFrame列计算汇总统计
min,max最小值和最大值
argmin, argmax索引位置
idxmin, idxmax索引值
quantile计算样本的分位数(0到1)
sum值的总和
mean值的平均数
median值的算数中位数
mad根据平均值计算平均绝对离差
var样本值的方差
std样本值的标准差
skew样本值的偏度(三阶矩)
kurt样本值的峰度(四阶矩)
cumsum累计和
cumprod累计积
cummin,cummax样本值的累计最大值和累计最小值
diff计算一阶差分(对时间序列很有用)
pct_change计算百分数变化

相关系数与协方差

import pandas_datareader.data as web #直接pip install pandas-datareader 
all_data = {ticker: web.get_data_yahoo(ticker) for ticker in ['AAPL','IBM','MSFT','GOOG']}
all_data
{'AAPL':                   High         Low        Open       Close       Volume  \
 Date                                                                      
 2015-10-26   29.532499   28.730000   29.520000   28.820000  265335200.0   
 2015-10-27   29.135000   28.497499   28.850000   28.637501  279537600.0   
 2015-10-28   29.825001   29.014999   29.232500   29.817499  342205600.0   
 2015-10-29   30.172501   29.567499   29.674999   30.132500  204909200.0   
 2015-10-30   30.305000   29.862499   30.247499   29.875000  197461200.0   
 ...                ...         ...         ...         ...          ...   
 2020-10-19  120.419998  115.660004  119.959999  115.980003  120639300.0   
 2020-10-20  118.980003  115.629997  116.199997  117.510002  124423700.0   
 2020-10-21  118.709999  116.449997  116.669998  116.870003   89946000.0   
 2020-10-22  118.040001  114.589996  117.449997  115.750000  101988000.0   
 2020-10-23  116.550003  114.279999  116.389999  115.040001   82396600.0   
 
              Adj Close  
 Date                    
 2015-10-26   26.632969  
 2015-10-27   26.464319  
 2015-10-28   27.554773  
 2015-10-29   27.845867  
 2015-10-30   27.607912  
 ...                ...  
 2020-10-19  115.980003  
 2020-10-20  117.510002  
 2020-10-21  116.870003  
 2020-10-22  115.750000  
 2020-10-23  115.040001  
 
 [1259 rows x 6 columns],
 'IBM':                   High         Low        Open       Close      Volume  \
 Date                                                                     
 2015-10-26  145.000000  143.210007  144.750000  143.660004   3489400.0   
 2015-10-27  143.830002  137.330002  143.509995  137.860001  15008500.0   
 2015-10-28  141.229996  137.500000  137.919998  140.830002   8511400.0   
 2015-10-29  140.820007  139.020004  140.360001  140.550003   3710400.0   
 2015-10-30  141.330002  139.899994  140.440002  140.080002   3984000.0   
 ...                ...         ...         ...         ...         ...   
 2020-10-19  127.349998  125.080002  126.800003  125.519997   7478700.0   
 2020-10-20  120.150002  116.839996  119.800003  117.370003  21501100.0   
 2020-10-21  117.690002  114.790001  116.660004  115.059998   9755300.0   
 2020-10-22  116.059998  112.980003  115.000000  115.760002   7855800.0   
 2020-10-23  116.620003  115.529999  116.500000  116.000000   3890800.0   
 
              Adj Close  
 Date                    
 2015-10-26  115.042992  
 2015-10-27  110.398384  
 2015-10-28  112.776749  
 2015-10-29  112.552536  
 2015-10-30  112.176132  
 ...                ...  
 2020-10-19  125.519997  
 2020-10-20  117.370003  
 2020-10-21  115.059998  
 2020-10-22  115.760002  
 2020-10-23  116.000000  
 
 [1259 rows x 6 columns],
 'MSFT':                   High         Low        Open       Close      Volume  \
 Date                                                                     
 2015-10-26   54.320000   52.500000   52.529999   54.250000  64633300.0   
 2015-10-27   54.369999   53.580002   53.990002   53.689999  50999900.0   
 2015-10-28   53.980000   52.860001   53.540001   53.980000  47000800.0   
 2015-10-29   53.830002   53.220001   53.540001   53.360001  30202100.0   
 2015-10-30   53.990002   52.619999   53.320000   52.639999  46619800.0   
 ...                ...         ...         ...         ...         ...   
 2020-10-19  222.300003  213.720001  220.419998  214.220001  27625800.0   
 2020-10-20  217.369995  213.089996  215.800003  214.649994  22753500.0   
 2020-10-21  216.919998  213.119995  213.119995  214.800003  22724900.0   
 2020-10-22  216.059998  211.699997  213.929993  214.889999  22351500.0   
 2020-10-23  216.279999  213.160004  215.029999  216.229996  18841600.0   
 
              Adj Close  
 Date                    
 2015-10-26   49.302139  
 2015-10-27   48.793213  
 2015-10-28   49.056763  
 2015-10-29   48.493317  
 2015-10-30   47.838978  
 ...                ...  
 2020-10-19  214.220001  
 2020-10-20  214.649994  
 2020-10-21  214.800003  
 2020-10-22  214.889999  
 2020-10-23  216.229996  
 
 [1259 rows x 6 columns],
 'GOOG':                    High          Low         Open        Close   Volume  \
 Date                                                                      
 2015-10-26   719.150024   701.260010   701.549988   712.780029  2716600   
 2015-10-27   713.619995   704.549988   707.380005   708.489990  2245800   
 2015-10-28   712.979980   703.080017   707.330017   712.950012  2178900   
 2015-10-29   718.260010   710.010010   710.500000   716.919983  1456000   
 2015-10-30   718.000000   710.049988   715.729980   710.809998  1908800   
 ...                 ...          ...          ...          ...      ...   
 2020-10-19  1588.150024  1528.000000  1580.459961  1534.609985  1607100   
 2020-10-20  1577.500000  1525.670044  1527.050049  1555.930054  2241700   
 2020-10-21  1618.729980  1571.630005  1573.329956  1593.310059  2568300   
 2020-10-22  1621.989990  1585.000000  1593.050049  1615.329956  1433600   
 2020-10-23  1642.359985  1620.510010  1626.069946  1641.000000  1375200   
 
               Adj Close  
 Date                     
 2015-10-26   712.780029  
 2015-10-27   708.489990  
 2015-10-28   712.950012  
 2015-10-29   716.919983  
 2015-10-30   710.809998  
 ...                 ...  
 2020-10-19  1534.609985  
 2020-10-20  1555.930054  
 2020-10-21  1593.310059  
 2020-10-22  1615.329956  
 2020-10-23  1641.000000  
 
 [1259 rows x 6 columns]}
price = pd.DataFrame({ticker:data['Adj Close'] for ticker, data in all_data.items()})
volumn = pd.DataFrame({ticker: data['Volume'] for ticker, data in all_data.items()})
price
AAPLIBMMSFTGOOG
Date
2015-10-2626.632969115.04299249.302139712.780029
2015-10-2726.464319110.39838448.793213708.489990
2015-10-2827.554773112.77674949.056763712.950012
2015-10-2927.845867112.55253648.493317716.919983
2015-10-3027.607912112.17613247.838978710.809998
...............
2020-10-19115.980003125.519997214.2200011534.609985
2020-10-20117.510002117.370003214.6499941555.930054
2020-10-21116.870003115.059998214.8000031593.310059
2020-10-22115.750000115.760002214.8899991615.329956
2020-10-23115.040001116.000000216.2299961641.000000

1259 rows × 4 columns

volumn
AAPLIBMMSFTGOOG
Date
2015-10-26265335200.03489400.064633300.02716600
2015-10-27279537600.015008500.050999900.02245800
2015-10-28342205600.08511400.047000800.02178900
2015-10-29204909200.03710400.030202100.01456000
2015-10-30197461200.03984000.046619800.01908800
...............
2020-10-19120639300.07478700.027625800.01607100
2020-10-20124423700.021501100.022753500.02241700
2020-10-2189946000.09755300.022724900.02568300
2020-10-22101988000.07855800.022351500.01433600
2020-10-2382396600.03890800.018841600.01375200

1259 rows × 4 columns

计算价格百分数变化

returns = price.pct_change()
returns.tail()
AAPLIBMMSFTGOOG
Date
2020-10-19-0.025542-0.003256-0.024766-0.024412
2020-10-200.013192-0.0649300.0020070.013893
2020-10-21-0.005446-0.0196810.0006990.024024
2020-10-22-0.0095830.0060840.0004190.013820
2020-10-23-0.0061340.0020730.0062360.015892

corr(相关系数)和cov(协方差)

returns.MSFT.corr(returns.IBM)
0.5721816449554378
returns.corr() 
AAPLIBMMSFTGOOG
AAPL1.0000000.4904040.7161440.660741
IBM0.4904041.0000000.5721820.525325
MSFT0.7161440.5721821.0000000.780281
GOOG0.6607410.5253250.7802811.000000
returns.cov()
AAPLIBMMSFTGOOG
AAPL0.0003550.0001490.0002340.000205
IBM0.0001490.0002600.0001600.000139
MSFT0.0002340.0001600.0003010.000222
GOOG0.0002050.0001390.0002220.000270
returns.corrwith(returns.IBM)
AAPL    0.490404
IBM     1.000000
MSFT    0.572182
GOOG    0.525325
dtype: float64
returns.corrwith(returns.IBM, axis=1) #axis=1可以按行计算
Date
2015-10-26   NaN
2015-10-27   NaN
2015-10-28   NaN
2015-10-29   NaN
2015-10-30   NaN
              ..
2020-10-19   NaN
2020-10-20   NaN
2020-10-21   NaN
2020-10-22   NaN
2020-10-23   NaN
Length: 1259, dtype: float64

唯一值、值计数以及成员资格

obj = pd.Series(['c','a','d','a','a','b','b','c','c'])
unique()
uniques = obj.unique() #记得之前有提到过了,提取单一元素出来
uniques
array(['c', 'a', 'd', 'b'], dtype=object)
value_counts()
  • 按value的出现次数多少降序排序
obj.value_counts()
a    3
c    3
b    2
d    1
dtype: int64
pd.value_counts(obj.values,sort=False)
b    2
c    3
a    3
d    1
dtype: int64
isin
  • 如果元素在内,则该位置标记为True,否则为False
  • 可用来筛选出我们要的子集
obj
0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object
mask = obj.isin(['b','c'])
mask
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool
obj[mask]
0    c
5    b
6    b
7    c
8    c
dtype: object
Index.get_indexer
original = pd.Series(['c','a','b','b','c','a'])
uniques = pd.Series(['c','a','b'])
pd.Index(uniques).get_indexer(original)
array([0, 1, 2, 2, 0, 1], dtype=int64)
data = pd.DataFrame({'Qu1':[1,3,4,3,4],
                     'Qu2':[2,3,1,2,3],
                     'Qu3':[1,5,2,4,4]})
data
Qu1Qu2Qu3
0121
1335
2412
3324
4434
data.apply(pd.value_counts).fillna(0)
Qu1Qu2Qu3
11.01.01.0
20.02.01.0
32.02.00.0
42.00.02.0
50.00.01.0
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值