题目来自高级编程课程
给定一个csv文件,完成以下两题:
对应代码如下:
import random
import numpy
as np
import scipy
as sp
import pandas
as pd
import matplotlib.pyplot
as plt
import seaborn
as sns
import statsmodels.api
as sm
import statsmodels.formula.api
as smf
anascombe
= pd.read_csv(
'anscombe.csv')
print(anascombe.groupby(
'dataset')[
'x'].mean())
print(anascombe.groupby(
'dataset')[
'y'].mean())
print(anascombe.groupby(
'dataset')[
'x'].var())
print(anascombe.groupby(
'dataset')[
'y'].var())
print(anascombe.groupby(
'dataset').corr())
dataset_names
= [
'I',
'II',
'III',
'IV']
for i
in dataset_names:
n
=
len(anascombe[anascombe.dataset
== i])
is_train
= np.random.rand(n)
<
0.7
train
= anascombe[anascombe.dataset
== i][is_train].reset_index(
drop
=
True)
test
= anascombe[anascombe.dataset
== i][
~is_train].reset_index(
drop
=
True)
lin_model
= smf.ols(
'y ~ x', train).fit()
print(lin_model.summary())
g
= sns.FacetGrid(anascombe,
col
=
'dataset')
g.map(plt.scatter,
'x',
'y')
plt.show()
程序命令行输出:
dataset
I 9.0
II 9.0
III 9.0
IV 9.0
Name: x, dtype: float64
dataset
I 7.500909
II 7.500909
III 7.500000
IV 7