[高级编程技术作业-Week 14]IPython notebooks, Pandas, Statsmodels

最新推荐文章于 2023-02-08 10:37:07 发布

ZacharyLei

最新推荐文章于 2023-02-08 10:37:07 发布

阅读量315

点赞数 1

CC 4.0 BY-SA版权

分类专栏：高级编程技术文章标签： Python

本文链接：https://blog.youkuaiyun.com/ZacharyLei/article/details/80650830

高级编程技术专栏收录该内容

16 篇文章

订阅专栏

本篇博客通过计算四个不同数据集的均值、方差和相关系数，并使用线性回归分析来揭示Anscombe四重奏数据集的特性。尽管这些数据集的统计属性相似，但其散点图显示了截然不同的分布形态。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Anscombe's quartet

Anscombe's quartet comprises of four datasets, and is rather famous. Why? You'll find out in this exercise.

	dataset	x	y
0	I	10	8.04
1	I	8	6.95
2	I	13	7.58
3	I	9	8.81
4	I	11	8.33

Part 1

For each of the four datasets...

Compute the mean and variance of both x and y
Compute the correlation coefficient between x and y
Compute the linear regression line: y=β0+β1x+ϵ (hint: use statsmodels and look at the Statsmodels notebook)

Part 2

Using Seaborn, visualize all four datasets.

hint: use sns.FacetGrid combined with plt.scatter

Codes:

import random

import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import statsmodels.formula.api as smf

sns.set_context("talk")

anascombe = pd.read_csv('anscombe.csv')
anascombe.head()

#1-1 Compute the mean and variance of both x and y 
x_mean = anascombe.groupby('dataset')['x'].mean()
print('mean of x', x_mean)
y_mean = anascombe.groupby('dataset')['y'].mean()
print('mean of y', y_mean)

x_var = anascombe.groupby('dataset')['x'].var()
print('variance of x', x_var)
y_var = anascombe.groupby('dataset')['y'].var()
print('variance of y', y_var)

#1-2 Compute the correlation coefficient between x and y
cor = anascombe.groupby("dataset")['x'].corr(anascombe['y'])
print('correlation coefficient between x and y', cor)

#1-3 Compute the linear regression line: y=beta_0+beta_1*x+epsilon
for i in range(0,4):
    X = anascombe[i*11:i*11+11]['x']
    Y = anascombe[i*11:i*11+11]['y']
    X = sm.add_constant(X)
    ols = sm.OLS(Y, X)
    reg_func = ols.fit()
    print('dataset '+str(i+1), "y = "+str(reg_func.params[0])+"+"+str(reg_func.params[1])+"x")

#2 Using Seaborn, visualize all four datasets
m = sns.FacetGrid(anascombe, col="dataset")    
m.map(plt.scatter, "x","y")
plt.show()