python数据分析---ch11 python数据描述性统计_python 描述性分析-优快云博客

本文链接：https://blog.youkuaiyun.com/s1164548515/article/details/139619285

python数据分析--- ch11 python数据描述性统计

1. Ch11--描述性统计
2. 数据集中趋势的度量
3. 数据离散趋势的度量
4. 峰度、偏度与正态性检验
5. 异常数据识别与处理

1. Ch11–描述性统计

描述性统计工具用于总结和组织数据，提供数据的中心趋势、分布和形状的度量。

使用场景：当需要快速了解数据集的基本特征时。
功能：计算均值、中位数、方差、标准差等。

Python描述性统计工具
Pandas 是一个强大的 Python 数据分析工具库，它提供了一系列用于数据分析和操作的函数。以下是一些常用的 Pandas 统计方法的整理表格，列出了函数名称及其作用：

Numpy 和Scipy包常用的统计方法：

2. 数据集中趋势的度量

集中趋势的度量包括均值、中位数和众数等，它们提供了数据的中心位置。

使用场景：分析数据的中心点。
功能：计算数据的均值、中位数、众数等。

在上表中：

𝑥𝑖表示数据集中的第 𝑖 个观测值。
𝑛表示数据集中观测值的总数。
𝑤𝑖表示观测值𝑥𝑖相应的权重。
𝑓(𝑥)表示数据集中各值出现的频率。

请注意，上表中提供的公式用于解释度量计算的基本原理，实际的库函数可能会采用不同的实现方法。例如，NumPy 的 mean() 函数可以直接计算算术平均值，而加权算术平均值可能需要用户手动实现，或者使用 SciPy 中的 average() 函数，该函数允许指定权重。

对于几何平均值和调和平均值，NumPy 和 SciPy 没有直接提供计算函数，但它们可以通过数学公式手动计算。在实际应用中，根据数据的特性（如是否包含零或负数）和分析目的，选择合适的度量方法是很重要的。例如，几何平均值常用于计算不同时间段的增长率的平均值，而调和平均值常用于处理倒数型数据，如速率或密度的平均。

# 两个常用的统计包
import scipy.stats as stats
import numpy as np
# 我们拿两个数据集来举例
x1 = [1, 2, 2, 3, 4, 5, 5, 7]
x2 = x1 + [100]

2.1 平均值

np.mean(x1)

output
3.625

np.mean(x2)

output
14.333333333333334

print('x1的平均值:', sum(x1), '/', len(x1), '=', np.mean(x1))
print('x2的平均值:', sum(x2), '/', len(x2), '=', np.mean(x2))

output
x1的平均值: 29 / 8 = 3.625
x2的平均值: 129 / 9 = 14.333333333333334

2.2 中位数

print('x1的中位数:', np.median(x1))
print('x2的中位数:', np.median(x2))

output
x1的中位数: 3.5
x2的中位数: 4.0

2.3 众数

Scipy 具有内置的求众数功能，但它只返回-个值，即使两个值出现的次数相同，也只返回一个值。

print('x1的众数:', stats.mode(x1))

output
x1的众数: ModeResult(mode=2, count=2)

试一试

自定义求众数函数

def mode(x):
    # 统计列表中每个元素出现的次数
    counts = {
   }
    for e in x:
        if e in counts:
            counts[e] += 1
        else:
            counts[e] = 1
            
    # 返回出现次数最多的元素
    maxcount = 0
    modes = {
   }
    for (key, value) in counts.items():
        if value > maxcount:
            maxcount = value
            modes = {
   key}
        elif value == maxcount:
            modes.add(key)
            
    if maxcount > 1 or len(x) == 1:
        return list(modes)
    return 'No mode'

print('x1的众数:', mode(x1))

output
x1的众数: [2, 5]

对于收益率数据可能没有哪个数据点会出现超过一次，此时应怎么处理？
可以使用bin值，正如我们构建直方图一样，此时统计哪个bin中数据点出现的次数多即可

import scipy.stats as stats
import numpy as np
import pandas as pd
# 获取收益率数据并计算出mode
# start = '2024-01-01'
# end = '2024-05-01'
s_601318 = pd.read_csv('./data/ch11_1.csv')
s_601318.head()

output

	ts_code	trade_date	open	high	low	close	pre_close	change	pct_chg	vol	amount
0	601318.SH	20230331	45.60	46.30	45.38	45.60	45.48	0.12	0.2639	524539.93	2401305.125
1	601318.SH	20230330	45.51	45.70	44.94	45.48	45.42	0.06	0.1321	450030.47	2035988.419
2	601318.SH	20230329	46.33	46.58	45.40	45.42	45.88	-0.46	-1.0026	409020.05	1872693.109
3	601318.SH	20230328	46.09	46.38	45.68	45.88	45.99	-0.11	-0.2392	283324.34	1300822.902
4	601318.SH	20230327	46.26	46.33	45.50	45.99	46.20	-0.21	-0.4545	362172.67	1658727.910

s_601318.tail()

output

	ts_code	trade_date	open	high	low	close	pre_close	change	pct_chg	vol	amount
2001	601318.SH	20150109	71.20	78.18	70.72	72.84	71.08	1.76	2.48	3118734.02	2.316418e+07
2002	601318.SH	20150108	74.50	74.92	70.80	71.08	73.41	-2.33	-3.17	1788809.15	1.288382e+07
2003	601318.SH	20150107	73.30	75.50	72.50	73.41	73.73	-0.32	-0.43	1703868.84	1.256595e+07
2004	601318.SH	20150106	74.38	76.77	72.01	73.73	76.16	-2.43	-3.19	2342279.69	1.743863e+07
2005	601318.SH	20150105	77.80	78.80	75.25	76.16	74.71	1.45	1.94	2435717.73	1.875204e+07


returns = s_601318[['pct_chg']]

print('收益率众数:', stats.mode(returns))

# 由于所有的收益率都是不同的，所以我们使用频率分布来变相计算mode
hist, bins = np.histogram(returns, 20) # 将数据分成20个bin
maxfreq = max(hist)
# 找出哪个bin里面出现的数据点次数最大，这个bin就当做计算出来的mode
print('bins的众数:', [(bins[i], bins[i+1]) for i, j in enumerate(hist) if j == maxfreq])

output
收益率众数: ModeResult(mode=array([0.]), count=array([18.]))
bins的众数: [(-0.9910000000000014, 0.00999999999999801)]

2.4 几何平均值

使用Scipy包中的gmean函数来计算几何平均值

print('x1几何平均值:', stats.gmean(x1))
print('x2几何平均值:', stats.gmean(x2))

output
x1几何平均值: 3.0941040249774403
x2几何平均值: 4.552534587620071

计算几何平均值时出现负的观测值怎么办？

import scipy.stats as stats
import numpy as np
import pandas as pd

returns = s_601318['pct_chg']

# 计算几何平均值
ratios = returns + np.ones(len(returns))  # 在每个元素上增加1
r_g = stats.gmean(ratios) - 1  # 收益率的几何平均值

print('收益率的几何平均值:', r_g)

pricing = s_601318['close']
T = len(pricing)

init_price = pricing.iloc[0]  # 获取初始价格
final_price = pricing.iloc[-1]  # 获取最终价格

print('最初价格:', init_price)
print('最终价格:', final_price)

# 通过几何平均收益率计算的最终价格
estimated_final_price = init_price * (1 + r_g) ** T
print('通过几何平均收益率计算的最终价格:', estimated_final_price)

output
收益率的几何平均值: nan
最初价格: 45.6
最终价格: 76.16
通过几何平均收益率计算的最终价格: nan
D:\Program Files (x86)\anaconda3\lib\site-packages\scipy\stats_stats_py.py:197: RuntimeWarning: invalid value encountered in log
log_a = np.log(a)

# 在每个元素上增加1来计算几何平均值
import scipy.stats as stats
import numpy as np
# returns = s_601318['pct_chg'].dropna() 
ratios = returns + 1
r_g = stats.gmean(ratios) - 1
print('收益率的几何平均值:', r_g)
pricing = s_601318['close']
T = len(pricing)
print(pricing[T-1])
init_price = pricing[0]
final_price = pricing[T-1]
print('最初价格:', init_price)
print('最终价格:', final_price)
print('通过几何平均收益率计算的最终价格:', init_price*(1 + r_g)**T)

output
收益率的几何平均值: nan
76.16
最初价格: 45.6
最终价格: 76.16
通过几何平均收益率计算的最终价格: nan

count_non_positive = (returns <= 0).sum()
if count_non_positive > 0:
    print(f"共有{
     len(returns)}个样本数据，其中 {
     count_non_positive} 个小于等于0的数。")
else:
    print("没有小于等于0的数。")

output
共有2006个样本数据，其中 1027 个小于等于0的数。

2.5 调和平均值

print('x1的调和平均值:', stats.hmean(x1))
print('x2的调和平均值:', stats.hmean(x2))

output
x1的调和平均值: 2.5590251332825593
x2的调和平均值: 2.869723656240511

3. 数据离散趋势的度量

离散趋势的度量包括方差、标准差和极差等，它们衡量数据的变异程度。

使用场景：了解数据的波动大小。
功能：计算方差、标准差。

import numpy as np
np.random.seed(121)
# 生成20个小于100的随机整数
x3 = np.random.randint(100, size=20)
x3 = np.sort(x3)
print('x3: %s' %(x3))
mu = np.mean(x3)
print('x3的平均值:', mu)

output
x3: [ 3 8 34 39 46 52 52 52 54 57 60 65 66 75 83 85 88 94 95 96]
x3的平均值: 60.2

3.1 极差

np.max(x3)-np.min(x3)

output
93

print('x3的极差: %s' %(np.ptp(x3)))

output
x3的极差: 93

3.2 平均绝对偏差(MAD)

abs_dispersion = [np.abs(mu - x) for x in x3]
MAD = np.sum(abs_dispersion)/len(abs_dispersion)
print('x3的平均绝对偏差:', MAD)

output
x3的平均绝对偏差: 20.520000000000003

3.3 方差和标准差

print('x3的方差:', np.var(x3))
print('x3的标准差:', np.std(x3))

output
x3的方差: 670.16
x3的标准差: 25.887448696231154

3.4 下偏方差和下偏标准差

# 没有现成的计算下偏方差的函数，因此我们手动计算：
lows = [x for x in x3 if x <= mu]
semivar = np.sum( (lows - mu) ** 2 ) / len(lows)
print('x3的下偏方差:', semivar)
print('x3的下偏标准差:', np.sqrt