Pandas

最新推荐文章于 2025-11-29 16:24:14 发布

原创最新推荐文章于 2025-11-29 16:24:14 发布 · 256 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#python

本文通过Python的pandas和statsmodels库对Anscombe四组数据集进行统计分析，包括计算x和y的均值、方差、相关系数，并使用Seaborn库绘制散点图，展示了尽管数据集具有相似的统计属性但实际分布却大相径庭的现象。

For each of the four datasets...

Compute the mean and variance of both x and y
Compute the correlation coefficient between x and y
Compute the linear regression line: (hint: use statsmodels and look at the Statsmodels notebook)

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
 
anascombe = pd.read_csv('anscombe.csv')
print('Mean of x:', anascombe['x'].mean())
print('Variance of x:', anascombe['x'].std())
print('Mean of y:', anascombe['y'].mean())
print('Variance of y:', anascombe['y'].std())
print('Correlation coefficient between x and y:\n', anascombe.corr())
 
model = ols('x ~ y', anascombe).fit()
print(model.summary())

Using Seaborn, visualize all four datasets.

import pandas as pd
import matplotlib.pyplot as plt
 
anascombe = pd.read_csv('anscombe.csv')
f, ax = plt.subplots()
ax.scatter(anascombe['x'], anascombe['y'])
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.show()