回归分析week2

最新推荐文章于 2022-07-17 13:06:58 发布

林思

最新推荐文章于 2022-07-17 13:06:58 发布

阅读量884

点赞数

CC 4.0 BY-SA版权

分类专栏： datascience 回归分析

本文链接：https://blog.youkuaiyun.com/u014596936/article/details/37967471

datascience 同时被 2 个专栏收录

25 篇文章

订阅专栏

回归分析

3 篇文章

订阅专栏

本文通过使用R语言分析了钻石数据集，探讨了如何利用线性回归模型预测钻石的价格，并深入研究了回归系数的意义及置信区间的计算方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

2.1 回归相关

Example

`diamond` data set from `UsingR`

Data is diamond prices (Signapore dollars) and diamond weight in carats (standard measure of diamond mass, 0.2 $g$). To get the data use library(UsingR); data(diamond)

Plotting the fitted regression line and data

data(diamond)
plot(diamond$carat, diamond$price,  
     xlab = "Mass (carats)", 
     ylab = "Price (SIN $)", 
     bg = "lightblue", 
     col = "black", cex = 1.1, pch = 21,frame = FALSE)
abline(lm(price ~ carat, data = diamond), lwd = 2)

Fitting the linear regression model

fit <- lm(price ~ carat, data = diamond)
coef(fit)

(Intercept)       carat 
     -259.6      3721.0

We estimate an expected 3721.02 (SIN) dollar increase in price for every carat increase in mass of diamond.
The intercept -259.63 is the expected price of a 0 carat diamond.

Getting a more interpretable intercept

fit2 <- lm(price ~ I(carat - mean(carat)), data = diamond)
coef(fit2)

           (Intercept) I(carat - mean(carat)) 
                 500.1                 3721.0

Thus $500.1 is the expected price for the average sized diamond of the data (0.2042 carats).

Changing scale

A one carat increase in a diamond is pretty big, what about changing units to 1/10th of a carat?
We can just do this by just dividing the coeficient by 10.
- We expect a 372.102 (SIN) dollar change in price for every 1/10th of a carat increase in mass of diamond.
Showing that it's the same if we rescale the Xs and refit

fit3 <- lm(price ~ I(carat * 10), data = diamond)
coef(fit3)

  (Intercept) I(carat * 10) 
       -259.6         372.1

Predicting the price of a diamond

newx <- c(0.16, 0.27, 0.34)
coef(fit)[1] + coef(fit)[2] * newx

[1]  335.7  745.1 1005.5

predict(fit, newdata = data.frame(carat = newx))

     1      2      3 
 335.7  745.1 1005.5

2.2 残差相关

Code

data(diamond)
y <- diamond$price; x <- diamond$carat; n <- length(y)
fit <- lm(y ~ x)
e <- resid(fit)
yhat <- predict(fit)
max(abs(e -(y - yhat)))

[1] 9.486e-13

max(abs(e - (y - coef(fit)[1] - coef(fit)[2] * x)))

[1] 9.486e-13

 
  ## Residuals are the signed length of the red lines
 
 
  ```{r, echo = FALSE, fig.height=5, fig.width=5}
 
 
  plot(diamond$carat, diamond$price,  
 
 
       xlab = "Mass (carats)", 
 
 
       ylab = "Price (SIN $)", 
 
 
       bg = "lightblue", 
 
 
       col = "black", cex = 1.1, pch = 21,frame = FALSE)
 
 
  abline(fit, lwd = 2)
 
 
  for (i in 1 : n) 
 
 
    lines(c(x[i], x[i]), c(y[i], yhat[i]), col = "red" , lwd = 2)
 
 
  ```

Diamond example

y <- diamond$price; x <- diamond$carat; n <- length(y)
fit <- lm(y ~ x)
summary(fit)$sigma

[1] 31.84

sqrt(sum(resid(fit)^2) / (n - 2))

[1] 31.84

Some facts about $R^2$

$R^2$ is the percentage of variation explained by the regression model.
$0 \leq R^2 \leq 1$
$R^2$ is the sample correlation squared.
$R^2$ can be a misleading summary of model fit.
- Deleting data can inflate $R^2$.
- (For later.) Adding terms to a regression model always increases $R^2$.
Do example(anscombe) to see the following data.
- Basically same mean and variance of X and Y.
- Identical correlations (hence same $R^2$ ).
- Same linear regression relationship.

2.3 推断

Example diamond data set

library(UsingR); data(diamond)
y <- diamond$price; x <- diamond$carat; n <- length(y)
beta1 <- cor(y, x) * sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)
e <- y - beta0 - beta1 * x
sigma <- sqrt(sum(e^2) / (n-2)) 
ssx <- sum((x - mean(x))^2)
seBeta0 <- (1 / n + mean(x) ^ 2 / ssx) ^ .5 * sigma 
seBeta1 <- sigma / sqrt(ssx)
tBeta0 <- beta0 / seBeta0; tBeta1 <- beta1 / seBeta1
pBeta0 <- 2 * pt(abs(tBeta0), df = n - 2, lower.tail = FALSE)
pBeta1 <- 2 * pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)
coefTable <- rbind(c(beta0, seBeta0, tBeta0, pBeta0), c(beta1, seBeta1, tBeta1, pBeta1))
colnames(coefTable) <- c("Estimate", "Std. Error", "t value", "P(>|t|)")
rownames(coefTable) <- c("(Intercept)", "x")

Example continued

coefTable

            Estimate Std. Error t value   P(>|t|)
(Intercept)   -259.6      17.32  -14.99 2.523e-19
x             3721.0      81.79   45.50 6.751e-40

fit <- lm(y ~ x); 
summary(fit)$coefficients

            Estimate Std. Error t value  Pr(>|t|)
(Intercept)   -259.6      17.32  -14.99 2.523e-19
x             3721.0      81.79   45.50 6.751e-40

Getting a confidence interval

sumCoef <- summary(fit)$coefficients
sumCoef[1,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[1, 2]

[1] -294.5 -224.8

sumCoef[2,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[2, 2]

[1] 3556 3886

With 95% confidence, we estimate that a 0.1 carat increase in diamond size results in a 355.6 to 388.6 increase in price in (Singapore) dollars.

Plotting the prediction intervals

plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
abline(fit, lwd = 2)
xVals <- seq(min(x), max(x), by = .01)
yVals <- beta0 + beta1 * xVals
se1 <- sigma * sqrt(1 / n + (xVals - mean(x))^2/ssx)
se2 <- sigma * sqrt(1 + 1 / n + (xVals - mean(x))^2/ssx)
lines(xVals, yVals + 2 * se1)
lines(xVals, yVals - 2 * se1)
lines(xVals, yVals + 2 * se2)
lines(xVals, yVals - 2 * se2)

Discussion

Both intervals have varying widths.
- Least width at the mean of the Xs.
We are quite confident in the regression line, so that interval is very narrow.
- If we knew $\beta_0$ and $\beta_1$ this interval would have zero width.
The prediction interval must incorporate the variabilibity in the data around the line.
- Even if we knew $\beta_0$ and $\beta_1$ this interval would still have width.

In R

newdata <- data.frame(x = xVals)
p1 <- predict(fit, newdata, interval = ("confidence"))
p2 <- predict(fit, newdata, interval = ("prediction"))
plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
abline(fit, lwd = 2)
lines(xVals, p1[,2]); lines(xVals, p1[,3])
lines(xVals, p2[,2]); lines(xVals, p2[,3])

2.4 多元变量

Multivariable regression analyses

If I were to present evidence of a relationship between breath mint useage (mints per day, X) and pulmonary function (measured in FEV), you would be skeptical.
- Likely, you would say, 'smokers tend to use more breath mints than non smokers, smoking is related to a loss in pulmonary function. That's probably the culprit.'
- If asked what would convince you, you would likely say, 'If non-smoking breath mint users had lower lung function than non-smoking non-breath mint users and, similarly, if smoking breath mint users had lower lung function than smoking non-breath mint users, I'd be more inclined to believe you'.
In other words, to even consider my results, I would have to demonstrate that they hold while holding smoking status fixed.

Multivariable regression analyses

An insurance company is interested in how last year's claims can predict a person's time in the hospital this year.
- They want to use an enormous amount of data contained in claims to predict a single number. Simple linear regression is not equipped to handle more than one predictor.
How can one generalize SLR to incoporate lots of regressors for the purpose of prediction?
What are the consequences of adding lots of regressors?
- Surely there must be consequences to throwing variables in that aren't related to Y?
- Surely there must be consequences to omitting variables that are?

Demonstration that it works using an example

Linear model with two variables and an intercept

n <- 100; x <- rnorm(n); x2 <- rnorm(n); x3 <- rnorm(n)
y <- x + x2 + x3 + rnorm(n, sd = .1)
e <- function(a, b) a -  sum( a * b ) / sum( b ^ 2) * b
ey <- e(e(y, x2), e(x3, x2))
ex <- e(e(x, x2), e(x3, x2))
sum(ey * ex) / sum(ex ^ 2)

[1] 1.004

coef(lm(y ~ x + x2 + x3 - 1)) #the -1 removes the intercept term

     x     x2     x3 
1.0040 0.9899 1.0078

Showing that order doesn't matter

ey <- e(e(y, x3), e(x2, x3))
ex <- e(e(x, x3), e(x2, x3))
sum(ey * ex) / sum(ex ^ 2)

[1] 1.004

coef(lm(y ~ x + x2 + x3 - 1)) #the -1 removes the intercept term

     x     x2     x3 
1.0040 0.9899 1.0078

Residuals again

ey <- resid(lm(y ~ x2 + x3 - 1))
ex <- resid(lm(x ~ x2 + x3 - 1))
sum(ey * ex) / sum(ex ^ 2)

[1] 1.004

coef(lm(y ~ x + x2 + x3 - 1)) #the -1 removes the intercept term

     x     x2     x3 
1.0040 0.9899 1.0078

Linear models

Linear models are the single most important applied statistical and machine learning techniqe, by far.
Some amazing things that you can accomplish with linear models
- Decompose a signal into its harmonics.
- Flexibly fit complicated functions.
- Fit factor variables as predictors.
- Uncover complex multivariate relationships with the response.
- Build accurate prediction models.