统计学笔记一：STAT 526 序章(1)

最新推荐文章于 2021-06-20 13:52:17 发布

Zimo_Zhang

最新推荐文章于 2021-06-20 13:52:17 发布

阅读量1.2k

点赞数

CC 4.0 BY-SA版权

分类专栏：统计 (Statistic Methodology)

本文链接：https://blog.youkuaiyun.com/Zimo_Zhang/article/details/90014326

本文是STAT 526课程笔记的第一部分，主要介绍了统计学中常用的R指令，包括线性回归、F检验、T检验和回归诊断。通过实例展示了如何在R中进行数据观察、绘图、建立模型，并解释了各种统计概念，如线性模型的一般形式、F检验的用途以及T检验的注意事项。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

前言

这学期选了STAT 526，课程有关genealized linear models, mixed effect models，nonoarametric regression models的介绍非常详实，课程教材使用的是
Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models 。Review也会按照教材章节的顺序进行整理，另外博主在自学时，也参考了PennState STAT 502的课程网页，相比书本的内容更加简洁明了，链接如下：https://newonlinecourses.science.psu.edu/stat502/node/1/

此文的主要目的是对课程中涉及到的统计知识以及对应的R语言的使用进行总结，也算是正式开始写我的第一篇技术博客。

统计常用的R指令

faraway 包是本书以及其他基于R的统计教材常用的数据包。这里以对gavote包进行线性回归为例,简单介绍数据观察，绘图，建立模型以及读取模型参数常用的函数以及指令。
读取faraway包中的数据gavote

data(gavote,package='faraway')

显示数据集的前十个元素

head(gavote, n=10L) # define the number of elements you want to observe

         equip   econ perAA rural    atlanta gore  bush other votes ballots
APPLING  LEVER   poor 0.182 rural notAtlanta 2093  3940    66  6099    6617
ATKINSON LEVER   poor 0.230 rural notAtlanta  821  1228    22  2071    2149
BACON    LEVER   poor 0.131 rural notAtlanta  956  2010    29  2995    3347
BAKER    OS-CC   poor 0.476 rural notAtlanta  893   615    11  1519    1607
BALDWIN  LEVER middle 0.359 rural notAtlanta 5893  6041   192 12126   12785
BANKS    LEVER middle 0.024 rural notAtlanta 1220  3202   111  4533    4773
BARROW   OS-CC middle 0.079 urban notAtlanta 3657  7925   520 12102   12522
BARTOW   OS-PC middle 0.079 urban    Atlanta 7508 14720   552 22780   23735
BEN.HILL OS-PC   poor 0.282 rural notAtlanta 2234  2381    46  4661    5741
BERRIEN  OS-CC   poor 0.107 rural notAtlanta 1640  2718    52  4410    4475

显示数据集结构

str(gavote) # show the attributes of each variables involved in the model

output：
’data.frame’: 159 obs. of 10 variables:
$ equip : Factor w/ 5 levels "LEVER","OS-CC",..: 1 1 1 2 1 1 2 3 3 2 ...
$ econ : Factor w/ 3 levels "middle","poor",..: 2 2 2 2 1 1 1 1 2 2 ...
$ perAA : num 0.182 0.23 0.131 0.476 0.359 0.024 0.079 0.079 0.282 0.107 ...
$ rural : Factor w/ 2 levels "rural","urban": 1 1 1 1 1 1 2 2 1 1 ...
$ atlanta: Factor w/ 2 levels "Atlanta","notAtlanta": 2 2 2 2 2 2 2 1 2 2 ...
$ gore : int 2093 821 956 893 5893 1220 3657 7508 2234 1640 ...
$ bush : int 3940 1228 2010 615 6041 3202 7925 14720 2381 2718 ...
$ other : int 66 22 29 11 192 111 520 552 46 52 ...
$ votes : int 6099 2071 2995 1519 12126 4533 12102 22780 4661 4410 ...
$ ballots: int 6617 2149 3347 1607 12785 4773 12522 23735 5741 4475 ...

对数据集进行描述性统计

summary(gavote) # make a statistical discrption of the dataset

 equip        econ        perAA          rural           atlanta   
 LEVER:74   middle:69   Min.   :0.0000   rural:117   Atlanta   : 15  
 OS-CC:44   poor  :72   1st Qu.:0.1115   urban: 42   notAtlanta:144  
 OS-PC:22   rich  :18   Median :0.2330                               
 PAPER: 2               Mean   :0.2430                               
 PUNCH:17               3rd Qu.:0.3480                               
                        Max.   :0.7650                               
      gore             bush            other            votes       
 Min.   :   249   Min.   :   271   Min.   :   5.0   Min.   :   832  
 1st Qu.:  1386   1st Qu.:  1804   1st Qu.:  30.0   1st Qu.:  3506  
 Median :  2326   Median :  3597   Median :  86.0   Median :  6299  
 Mean   :  7020   Mean   :  8929   Mean   : 381.7   Mean   : 16331  
 3rd Qu.:  4430   3rd Qu.:  7468   3rd Qu.: 210.0   3rd Qu.: 11846  
 Max.   :154509   Max.   :140494   Max.   :7920.0   Max.   :263211  
    ballots         undercount     
 Min.   :   881   Min.   :0.00000  
 1st Qu.:  3694   1st Qu.:0.02779  
 Median :  6712   Median :0.03983  
 Mean   : 16927   Mean   :0.04379  
 3rd Qu.: 12251   3rd Qu.:0.05647  
 Max.   :280975   Max.   :0.18812

密度分布图
density函数就可以显示数据分布，类似于平滑的hist()函数。
rug函数可显示个体数据在数据横轴中的位置。

plot(density(gavote$undercount),main="Undercount")
rug(gavote$undercount) 
# added a “rug” to our display that makes it possible to discern the individual data points

Pareto Chart （帕累托图）
Pareto chart, named after Vilfredo Pareto, is a type of chart where individual values are represented in descending order by bars.

barplot(sort(table(gavote$equip),decreasing=TRUE),las=2) 
#	las=?  labels are parallel (=0) or perpendicular(=2) to axis

Boxplot（箱型图）
箱型图可显示数据的六类特征，最大值，最小值，中值，第一分位数，第三分位数以及离群点。

plot(undercount ~ equip, gavote, xlab="", las=1)

Cross tabulation（交叉制表）

xtabs(~ atlanta + rural, gavote) # two-way cross table

 rural
atlanta      rural urban
  Atlanta        1    14
  notAtlanta   116    28

提取协方差矩阵
注意：R语言中数组编号从1开始。
缺省行变量，即可得变量协方差矩阵

nix <- c(3,10,11,12)
cor(gavote[,nix])
#We select these columns from the dataframe and compute the correlation. 
The syntax for selecting rows and/or columns is dataframe[rows,columns] 
where rows and/or columns are vectors of indices.

		       perAA     ballots undercount    pergore
perAA      1.0000000  0.02773230  0.2296874 0.92165247
ballots    0.0277323  1.00000000 -0.1551724 0.09561688
undercount 0.2296874 -0.15517245  1.0000000 0.21876519
pergore    0.9216525  0.09561688  0.2187652 1.00000000

Linear regression（线性回归）

lmod <- lm(undercount ~ pergore+ perAA,data=gavote,)

提取模型系数

coef(lmod)

提取预测值

predict(lmod)

获取残差
residuals()函数的应用将在后续文章中继续。

residuals(lmod)

模型残差平方和(Residual sum of squares/ RSS)

deviance(lmod)

模型自由度

df.residual(lmod)

提取模型信息

lmodsum <- summary(lmod)
lmodsum$r.squared #模型R^2
cor(predict(lmod),gavote$undercount)^2 #另一种获取R^2的方法
lmodsum$adj.r.squared #模型修正R^2

sumary()相比summary()可以获得更简洁的模型信息

            Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.032376   0.012761  2.5370  0.01216
pergore     0.010979   0.046922  0.2340  0.81531
perAA       0.028533   0.030738  0.9283  0.35470

n = 159, p = 3, Residual SE = 0.02445, R-Squared = 0.05