回归分析week2

本文通过使用R语言分析了钻石数据集,探讨了如何利用线性回归模型预测钻石的价格,并深入研究了回归系数的意义及置信区间的计算方法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

2.1 回归相关





Example

diamond data set from UsingR

Data is diamond prices (Signapore dollars) and diamond weight in carats (standard measure of diamond mass, 0.2 $g$). To get the data use library(UsingR); data(diamond)

Plotting the fitted regression line and data

data(diamond)
plot(diamond$carat, diamond$price,  
     xlab = "Mass (carats)", 
     ylab = "Price (SIN $)", 
     bg = "lightblue", 
     col = "black", cex = 1.1, pch = 21,frame = FALSE)
abline(lm(price ~ carat, data = diamond), lwd = 2)

Fitting the linear regression model

fit <- lm(price ~ carat, data = diamond)
coef(fit)
(Intercept)       carat 
     -259.6      3721.0 
  • We estimate an expected 3721.02 (SIN) dollar increase in price for every carat increase in mass of diamond.
  • The intercept -259.63 is the expected price of a 0 carat diamond.

Getting a more interpretable intercept

fit2 <- lm(price ~ I(carat - mean(carat)), data = diamond)
coef(fit2)
           (Intercept) I(carat - mean(carat)) 
                 500.1                 3721.0 

Thus $500.1 is the expected price for the average sized diamond of the data (0.2042 carats).

Changing scale

  • A one carat increase in a diamond is pretty big, what about changing units to 1/10th of a carat?
  • We can just do this by just dividing the coeficient by 10.
    • We expect a 372.102 (SIN) dollar change in price for every 1/10th of a carat increase in mass of diamond.
  • Showing that it's the same if we rescale the Xs and refit
fit3 <- lm(price ~ I(carat * 10), data = diamond)
coef(fit3)
  (Intercept) I(carat * 10) 
       -259.6         372.1 

Predicting the price of a diamond

newx <- c(0.16, 0.27, 0.34)
coef(fit)[1] + coef(fit)[2] * newx
[1]  335.7  745.1 1005.5
predict(fit, newdata = data.frame(carat = newx))
     1      2      3 
 335.7  745.1 1005.5 

2.2 残差相关




Code

data(diamond)
y <- diamond$price; x <- diamond$carat; n <- length(y)
fit <- lm(y ~ x)
e <- resid(fit)
yhat <- predict(fit)
max(abs(e -(y - yhat)))
[1] 9.486e-13
max(abs(e - (y - coef(fit)[1] - coef(fit)[2] * x)))
[1] 9.486e-13


 
## Residuals are the signed length of the red lines
```{r, echo = FALSE, fig.height=5, fig.width=5}
plot(diamond$carat, diamond$price,
     xlab = "Mass (carats)",
     ylab = "Price (SIN $)",
     bg = "lightblue",
     col = "black", cex = 1.1, pch = 21,frame = FALSE)
abline(fit, lwd = 2)
for (i in 1 : n)
  lines(c(x[i], x[i]), c(y[i], yhat[i]), col = "red" , lwd = 2)
```





Diamond example

y <- diamond$price; x <- diamond$carat; n <- length(y)
fit <- lm(y ~ x)
summary(fit)$sigma
[1] 31.84
sqrt(sum(resid(fit)^2) / (n - 2))
[1] 31.84




Some facts about $R^2$

  • $R^2$ is the percentage of variation explained by the regression model.
  • $0 \leq R^2 \leq 1$
  • $R^2$ is the sample correlation squared.
  • $R^2$ can be a misleading summary of model fit.
    • Deleting data can inflate $R^2$.
    • (For later.) Adding terms to a regression model always increases $R^2$.
  • Do example(anscombe) to see the following data.
    • Basically same mean and variance of X and Y.
    • Identical correlations (hence same $R^2$ ).
    • Same linear regression relationship.

2.3 推断





Example diamond data set

library(UsingR); data(diamond)
y <- diamond$price; x <- diamond$carat; n <- length(y)
beta1 <- cor(y, x) * sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)
e <- y - beta0 - beta1 * x
sigma <- sqrt(sum(e^2) / (n-2)) 
ssx <- sum((x - mean(x))^2)
seBeta0 <- (1 / n + mean(x) ^ 2 / ssx) ^ .5 * sigma 
seBeta1 <- sigma / sqrt(ssx)
tBeta0 <- beta0 / seBeta0; tBeta1 <- beta1 / seBeta1
pBeta0 <- 2 * pt(abs(tBeta0), df = n - 2, lower.tail = FALSE)
pBeta1 <- 2 * pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)
coefTable <- rbind(c(beta0, seBeta0, tBeta0, pBeta0), c(beta1, seBeta1, tBeta1, pBeta1))
colnames(coefTable) <- c("Estimate", "Std. Error", "t value", "P(>|t|)")
rownames(coefTable) <- c("(Intercept)", "x")

Example continued

coefTable
            Estimate Std. Error t value   P(>|t|)
(Intercept)   -259.6      17.32  -14.99 2.523e-19
x             3721.0      81.79   45.50 6.751e-40
fit <- lm(y ~ x); 
summary(fit)$coefficients
            Estimate Std. Error t value  Pr(>|t|)
(Intercept)   -259.6      17.32  -14.99 2.523e-19
x             3721.0      81.79   45.50 6.751e-40

Getting a confidence interval

sumCoef <- summary(fit)$coefficients
sumCoef[1,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[1, 2]
[1] -294.5 -224.8
sumCoef[2,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[2, 2]
[1] 3556 3886

With 95% confidence, we estimate that a 0.1 carat increase in diamond size results in a 355.6 to 388.6 increase in price in (Singapore) dollars.

Plotting the prediction intervals

plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
abline(fit, lwd = 2)
xVals <- seq(min(x), max(x), by = .01)
yVals <- beta0 + beta1 * xVals
se1 <- sigma * sqrt(1 / n + (xVals - mean(x))^2/ssx)
se2 <- sigma * sqrt(1 + 1 / n + (xVals - mean(x))^2/ssx)
lines(xVals, yVals + 2 * se1)
lines(xVals, yVals - 2 * se1)
lines(xVals, yVals + 2 * se2)
lines(xVals, yVals - 2 * se2)

Discussion

  • Both intervals have varying widths.
    • Least width at the mean of the Xs.
  • We are quite confident in the regression line, so that interval is very narrow.
    • If we knew $\beta_0$ and $\beta_1$ this interval would have zero width.
  • The prediction interval must incorporate the variabilibity in the data around the line.
    • Even if we knew $\beta_0$ and $\beta_1$ this interval would still have width.

In R

newdata <- data.frame(x = xVals)
p1 <- predict(fit, newdata, interval = ("confidence"))
p2 <- predict(fit, newdata, interval = ("prediction"))
plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
abline(fit, lwd = 2)
lines(xVals, p1[,2]); lines(xVals, p1[,3])
lines(xVals, p2[,2]); lines(xVals, p2[,3])



2.4  多元变量

Multivariable regression analyses

  • If I were to present evidence of a relationship between breath mint useage (mints per day, X) and pulmonary function (measured in FEV), you would be skeptical.
    • Likely, you would say, 'smokers tend to use more breath mints than non smokers, smoking is related to a loss in pulmonary function. That's probably the culprit.'
    • If asked what would convince you, you would likely say, 'If non-smoking breath mint users had lower lung function than non-smoking non-breath mint users and, similarly, if smoking breath mint users had lower lung function than smoking non-breath mint users, I'd be more inclined to believe you'.
  • In other words, to even consider my results, I would have to demonstrate that they hold while holding smoking status fixed.

Multivariable regression analyses

  • An insurance company is interested in how last year's claims can predict a person's time in the hospital this year.
    • They want to use an enormous amount of data contained in claims to predict a single number. Simple linear regression is not equipped to handle more than one predictor.
  • How can one generalize SLR to incoporate lots of regressors for the purpose of prediction?
  • What are the consequences of adding lots of regressors?
    • Surely there must be consequences to throwing variables in that aren't related to Y?
    • Surely there must be consequences to omitting variables that are?













Demonstration that it works using an example

Linear model with two variables and an intercept

n <- 100; x <- rnorm(n); x2 <- rnorm(n); x3 <- rnorm(n)
y <- x + x2 + x3 + rnorm(n, sd = .1)
e <- function(a, b) a -  sum( a * b ) / sum( b ^ 2) * b
ey <- e(e(y, x2), e(x3, x2))
ex <- e(e(x, x2), e(x3, x2))
sum(ey * ex) / sum(ex ^ 2)
[1] 1.004
coef(lm(y ~ x + x2 + x3 - 1)) #the -1 removes the intercept term
     x     x2     x3 
1.0040 0.9899 1.0078 

Showing that order doesn't matter

ey <- e(e(y, x3), e(x2, x3))
ex <- e(e(x, x3), e(x2, x3))
sum(ey * ex) / sum(ex ^ 2)
[1] 1.004
coef(lm(y ~ x + x2 + x3 - 1)) #the -1 removes the intercept term
     x     x2     x3 
1.0040 0.9899 1.0078 

Residuals again

ey <- resid(lm(y ~ x2 + x3 - 1))
ex <- resid(lm(x ~ x2 + x3 - 1))
sum(ey * ex) / sum(ex ^ 2)
[1] 1.004
coef(lm(y ~ x + x2 + x3 - 1)) #the -1 removes the intercept term
     x     x2     x3 
1.0040 0.9899 1.0078 



Linear models

  • Linear models are the single most important applied statistical and machine learning techniqe, by far.
  • Some amazing things that you can accomplish with linear models
    • Decompose a signal into its harmonics.
    • Flexibly fit complicated functions.
    • Fit factor variables as predictors.
    • Uncover complex multivariate relationships with the response.
    • Build accurate prediction models.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值