前言
这学期选了STAT 526,课程有关genealized linear models, mixed effect models,nonoarametric regression models的介绍非常详实,课程教材使用的是
Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models 。Review也会按照教材章节的顺序进行整理,另外博主在自学时,也参考了PennState STAT 502的课程网页,相比书本的内容更加简洁明了,链接如下:https://newonlinecourses.science.psu.edu/stat502/node/1/
此文的主要目的是对课程中涉及到的统计知识以及对应的R语言的使用进行总结,也算是正式开始写我的第一篇技术博客。
统计常用的R指令
faraway 包是本书以及其他基于R的统计教材常用的数据包。这里以对gavote包进行线性回归为例,简单介绍数据观察,绘图,建立模型以及读取模型参数常用的函数以及指令。
读取faraway包中的数据gavote
data(gavote,package='faraway')
显示数据集的前十个元素
head(gavote, n=10L) # define the number of elements you want to observe
equip econ perAA rural atlanta gore bush other votes ballots
APPLING LEVER poor 0.182 rural notAtlanta 2093 3940 66 6099 6617
ATKINSON LEVER poor 0.230 rural notAtlanta 821 1228 22 2071 2149
BACON LEVER poor 0.131 rural notAtlanta 956 2010 29 2995 3347
BAKER OS-CC poor 0.476 rural notAtlanta 893 615 11 1519 1607
BALDWIN LEVER middle 0.359 rural notAtlanta 5893 6041 192 12126 12785
BANKS LEVER middle 0.024 rural notAtlanta 1220 3202 111 4533 4773
BARROW OS-CC middle 0.079 urban notAtlanta 3657 7925 520 12102 12522
BARTOW OS-PC middle 0.079 urban Atlanta 7508 14720 552 22780 23735
BEN.HILL OS-PC poor 0.282 rural notAtlanta 2234 2381 46 4661 5741
BERRIEN OS-CC poor 0.107 rural notAtlanta 1640 2718 52 4410 4475
显示数据集结构
str(gavote) # show the attributes of each variables involved in the model
output:
’data.frame’: 159 obs. of 10 variables:
$ equip : Factor w/ 5 levels "LEVER","OS-CC",..: 1 1 1 2 1 1 2 3 3 2 ...
$ econ : Factor w/ 3 levels "middle","poor",..: 2 2 2 2 1 1 1 1 2 2 ...
$ perAA : num 0.182 0.23 0.131 0.476 0.359 0.024 0.079 0.079 0.282 0.107 ...
$ rural : Factor w/ 2 levels "rural","urban": 1 1 1 1 1 1 2 2 1 1 ...
$ atlanta: Factor w/ 2 levels "Atlanta","notAtlanta": 2 2 2 2 2 2 2 1 2 2 ...
$ gore : int 2093 821 956 893 5893 1220 3657 7508 2234 1640 ...
$ bush : int 3940 1228 2010 615 6041 3202 7925 14720 2381 2718 ...
$ other : int 66 22 29 11 192 111 520 552 46 52 ...
$ votes : int 6099 2071 2995 1519 12126 4533 12102 22780 4661 4410 ...
$ ballots: int 6617 2149 3347 1607 12785 4773 12522 23735 5741 4475 ...
对数据集进行描述性统计
summary(gavote) # make a statistical discrption of the dataset
equip econ perAA rural atlanta
LEVER:74 middle:69 Min. :0.0000 rural:117 Atlanta : 15
OS-CC:44 poor :72 1st Qu.:0.1115 urban: 42 notAtlanta:144
OS-PC:22 rich :18 Median :0.2330
PAPER: 2 Mean :0.2430
PUNCH:17 3rd Qu.:0.3480
Max. :0.7650
gore bush other votes
Min. : 249 Min. : 271 Min. : 5.0 Min. : 832
1st Qu.: 1386 1st Qu.: 1804 1st Qu.: 30.0 1st Qu.: 3506
Median : 2326 Median : 3597 Median : 86.0 Median : 6299
Mean : 7020 Mean : 8929 Mean : 381.7 Mean : 16331
3rd Qu.: 4430 3rd Qu.: 7468 3rd Qu.: 210.0 3rd Qu.: 11846
Max. :154509 Max. :140494 Max. :7920.0 Max. :263211
ballots undercount
Min. : 881 Min. :0.00000
1st Qu.: 3694 1st Qu.:0.02779
Median : 6712 Median :0.03983
Mean : 16927 Mean :0.04379
3rd Qu.: 12251 3rd Qu.:0.05647
Max. :280975 Max. :0.18812
密度分布图
density函数就可以显示数据分布,类似于平滑的hist()函数。
rug函数可显示个体数据在数据横轴中的位置。
plot(density(gavote$undercount),main="Undercount")
rug(gavote$undercount)
# added a “rug” to our display that makes it possible to discern the individual data points

Pareto Chart (帕累托图)
Pareto chart, named after Vilfredo Pareto, is a type of chart where individual values are represented in descending order by bars.
barplot(sort(table(gavote$equip),decreasing=TRUE),las=2)
# las=? labels are parallel (=0) or perpendicular(=2) to axis

Boxplot(箱型图)
箱型图可显示数据的六类特征,最大值,最小值,中值,第一分位数,第三分位数以及离群点。
plot(undercount ~ equip, gavote, xlab="", las=1)

Cross tabulation(交叉制表)
xtabs(~ atlanta + rural, gavote) # two-way cross table
rural
atlanta rural urban
Atlanta 1 14
notAtlanta 116 28
提取协方差矩阵
注意:R语言中数组编号从1开始。
缺省行变量,即可得变量协方差矩阵
nix <- c(3,10,11,12)
cor(gavote[,nix])
#We select these columns from the dataframe and compute the correlation.
The syntax for selecting rows and/or columns is dataframe[rows,columns]
where rows and/or columns are vectors of indices.
perAA ballots undercount pergore
perAA 1.0000000 0.02773230 0.2296874 0.92165247
ballots 0.0277323 1.00000000 -0.1551724 0.09561688
undercount 0.2296874 -0.15517245 1.0000000 0.21876519
pergore 0.9216525 0.09561688 0.2187652 1.00000000
Linear regression(线性回归)
lmod <- lm(undercount ~ pergore+ perAA,data=gavote,)
提取模型系数
coef(lmod)
提取预测值
predict(lmod)
获取残差
residuals()函数的应用将在后续文章中继续。
residuals(lmod)
模型残差平方和(Residual sum of squares/ RSS)
deviance(lmod)
模型自由度
df.residual(lmod)
提取模型信息
lmodsum <- summary(lmod)
lmodsum$r.squared #模型R^2
cor(predict(lmod),gavote$undercount)^2 #另一种获取R^2的方法
lmodsum$adj.r.squared #模型修正R^2
sumary()相比summary()可以获得更简洁的模型信息
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.032376 0.012761 2.5370 0.01216
pergore 0.010979 0.046922 0.2340 0.81531
perAA 0.028533 0.030738 0.9283 0.35470
n = 159, p = 3, Residual SE = 0.02445, R-Squared = 0.05
Linear Model (线性回归)
线性回归的一般形式:
Y = β 0 + β 1 X 1 + β 2 X 2 + ⋯ + β p − 1 X p − 1 + ϵ Y=\beta_0+\beta_1X_1+\beta_2X_2+\cdots+\beta_{p-1}X_{p-1}+\epsilon Y=β0+β1X1+β2X2+