For each of the four datasets...
- Compute the mean and variance of both x and y
- Compute the correlation coefficient between x and y
- Compute the linear regression line: (hint: use statsmodels and look at the Statsmodels notebook)
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
anascombe = pd.read_csv('anscombe.csv')
print('Mean of x:', anascombe['x'].mean())
print('Variance of x:', anascombe['x'].std())
print('Mean of y:', anascombe['y'].mean())
print('Variance of y:', anascombe['y'].std())
print('Correlation coefficient between x and y:\n', anascombe.corr())
model = ols('x ~ y', anascombe).fit()
print(model.summary())Using Seaborn, visualize all four datasets.
import pandas as pd
import matplotlib.pyplot as plt
anascombe = pd.read_csv('anscombe.csv')
f, ax = plt.subplots()
ax.scatter(anascombe['x'], anascombe['y'])
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.show()
本文通过Python的pandas和statsmodels库对Anscombe四组数据集进行统计分析,包括计算x和y的均值、方差、相关系数,并使用Seaborn库绘制散点图,展示了尽管数据集具有相似的统计属性但实际分布却大相径庭的现象。
2169

被折叠的 条评论
为什么被折叠?



