回归模型选择(R语言版)
选择准测:
- With Cp , AIC and BIC, smaller values are better, but for adjusted R2 , larger values are better.
- Model choice should be guided by economic theory and practical considerations, as well as by model selection criteria.
案例
Illustrate model selection with Nelson-Plosser data set of U.S. yearly macroeconomic variables from 1909 to 1970, a total of 62 years as follows:
变量说明:
# sp: Stock Prices, [Index; 1941-43=100],
# gnp.r: Real GNP, [Billions of 1958 Dollars],
# gnp.pc: Real Per Capita GNP, [1958 Dollars],
# ip: industrial Production Index, [1967=100],
# cpi: Consumer Price Index, [1967=100],
# emp: Total Employment, [Thousands],
# bond: Basic Yields 30-year Corporate Bonds, [% pa].
# response: diff(log(sp))
# regressors: diff(gnp.r), diff(gnp.pc), diff(log(ip)), diff(log(cpi)), diff(emp), diff(bond).
# install.packages(c("fEcofin","MASS","car","lmtest"),repos= "http://cran.cnr.Berkeley.edu", dep=TRUE)
安装需要的包
rm(list=ls())
install.packages(c("MASS","car","leaps","faraway")) # install packages
install.packages("fEcofin", repos="http://R-Forge.R-project.org")
library(fEcofin) # call packages
?? fEcofin
data("nelsonplosser") # load data in above packages
list(nelsonplosser) # check data
names(nelsonplosser)
new_np=nelsonplosser[50:111, c(2,4,5,6,9,14,15)]
attach(new_np)
new_np
查看原数据
par(mfrow=c(2,3)) # 一页多图: 一个图版显示2行,3列
plot(diff(gnp.r), type="b")
plot(diff(log(gnp.r)),type="b")
plot(diff(sqrt(gnp.r)),type="b")
plot(diff(ip),type="b") #喇叭形,用对数消去
plot(diff(log(ip)),type="b")
plot(diff(sqrt(ip)),type="b")
# creat a scatterplot matrix 查看变量间的关系
pairs(cbind(diff(log(sp)), diff(gnp.r), diff(gnp.pc), diff(log(ip)),
diff(log(cpi)), diff(emp), diff(bnd)))
可以看到,var2-diff(gnp.r) 与var3-diff(gnp.pc)有很强的线性关系,变量var3-diff(gnp.pc)与变量var6-diff(emp)也有线性关系.
fit=lm(formula=diff(log(sp))~diff(gnp.r)+diff(gnp.pc)+diff(log(ip))+diff(log(cpi))+diff(emp)+diff(bnd), data=new_np ) # make linear refression
summary(fit) # dispay the result of fit
anova(fit)
在0.1的置信水平下,只有三个参数显著. 可能存在多重共线性,看一下方差膨胀因子(VIF)吧. VIF就是检查预测变量之间的共线性的. 一般认为VIF>10,认为有共线性.
library(faraway)
vif(fit)
output:
vif(fit)
diff(gnp.r) diff(gnp.pc) diff(log(ip)) diff(log(cpi)) diff(emp)
16.030836 31.777449 3.325433 1.290405 10.948953
diff(bnd)
1.481480
结果显示,变量diff(gnp.r), 变量diff(gnp.pc), 变量diff(emp)有线性关系.
然后我们用stepAIC
函数来找一个简约模型, 它是R中的变量选择过程,它首先由用户指定的模型开始,然后依次添加或删除变量。在每一步中,添加或删除的元素都能最大程度地提高AIC值。这个例子,stepAIC将从所有六个预测因子开始。
#### AIC 准则
library(MASS)
fit_AIC=stepAIC(fit)
summary(fit_AIC) #diff(gnp.r) diff(gnp.pc) diff(log(ip)) diff(bnd)
#####输出#####
Start: AIC=-224.92
diff(log(sp)) ~ diff(gnp.r) + diff(gnp.pc) + diff(log(ip)) +
diff(log(cpi