2.1 回归相关
Example
diamond
data set from UsingR
Data is diamond prices (Signapore dollars) and diamond weight in carats (standard measure of diamond mass, 0.2 $g$). To get the data use library(UsingR); data(diamond)
Plotting the fitted regression line and data
data(diamond)
plot(diamond$carat, diamond$price,
xlab = "Mass (carats)",
ylab = "Price (SIN $)",
bg = "lightblue",
col = "black", cex = 1.1, pch = 21,frame = FALSE)
abline(lm(price ~ carat, data = diamond), lwd = 2)
Fitting the linear regression model
(Intercept) carat
-259.6 3721.0
- We estimate an expected 3721.02 (SIN) dollar increase in price for every carat increase in mass of diamond.
- The intercept -259.63 is the expected price of a 0 carat diamond.
Getting a more interpretable intercept
(Intercept) I(carat - mean(carat))
500.1 3721.0
Thus $500.1 is the expected price for the average sized diamond of the data (0.2042 carats).
Changing scale
- A one carat increase in a diamond is pretty big, what about changing units to 1/10th of a carat?
- We can just do this by just dividing the coeficient by 10.
- We expect a 372.102 (SIN) dollar change in price for every 1/10th of a carat increase in mass of diamond.
- Showing that it's the same if we rescale the Xs and refit
(Intercept) I(carat * 10)
-259.6 372.1
Predicting the price of a diamond
[1] 335.7 745.1 1005.5
1 2 3
335.7 745.1 1005.5
2.2 残差相关
Code
[1] 9.486e-13
[1] 9.486e-13
## Residuals are the signed length of the red lines```{r, echo = FALSE, fig.height=5, fig.width=5}plot(diamond$carat, diamond$price,xlab = "Mass (carats)",ylab = "Price (SIN $)",bg = "lightblue",col = "black", cex = 1.1, pch = 21,frame = FALSE)abline(fit, lwd = 2)for (i in 1 : n)lines(c(x[i], x[i]), c(y[i], yhat[i]), col = "red" , lwd = 2)```
Diamond example
[1] 31.84
[1] 31.84
Some facts about $R^2$
- $R^2$ is the percentage of variation explained by the regression model.
- $0 \leq R^2 \leq 1$
- $R^2$ is the sample correlation squared.
- $R^2$ can be a misleading summary of model fit.
- Deleting data can inflate $R^2$.
- (For later.) Adding terms to a regression model always increases $R^2$.
- Do
example(anscombe)
to see the following data.- Basically same mean and variance of X and Y.
- Identical correlations (hence same $R^2$ ).
- Same linear regression relationship.
2.3 推断
Example diamond data set
Example continued
Estimate Std. Error t value P(>|t|)
(Intercept) -259.6 17.32 -14.99 2.523e-19
x 3721.0 81.79 45.50 6.751e-40
Estimate Std. Error t value Pr(>|t|)
(Intercept) -259.6 17.32 -14.99 2.523e-19
x 3721.0 81.79 45.50 6.751e-40
Getting a confidence interval
[1] -294.5 -224.8
[1] 3556 3886
With 95% confidence, we estimate that a 0.1 carat increase in diamond size results in a 355.6 to 388.6 increase in price in (Singapore) dollars.
Plotting the prediction intervals
plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
abline(fit, lwd = 2)
xVals <- seq(min(x), max(x), by = .01)
yVals <- beta0 + beta1 * xVals
se1 <- sigma * sqrt(1 / n + (xVals - mean(x))^2/ssx)
se2 <- sigma * sqrt(1 + 1 / n + (xVals - mean(x))^2/ssx)
lines(xVals, yVals + 2 * se1)
lines(xVals, yVals - 2 * se1)
lines(xVals, yVals + 2 * se2)
lines(xVals, yVals - 2 * se2)
Discussion
- Both intervals have varying widths.
- Least width at the mean of the Xs.
- We are quite confident in the regression line, so that interval is very narrow.
- If we knew $\beta_0$ and $\beta_1$ this interval would have zero width.
- The prediction interval must incorporate the variabilibity in the data around the line.
- Even if we knew $\beta_0$ and $\beta_1$ this interval would still have width.
In R
newdata <- data.frame(x = xVals)
p1 <- predict(fit, newdata, interval = ("confidence"))
p2 <- predict(fit, newdata, interval = ("prediction"))
plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
abline(fit, lwd = 2)
lines(xVals, p1[,2]); lines(xVals, p1[,3])
lines(xVals, p2[,2]); lines(xVals, p2[,3])
2.4 多元变量
Multivariable regression analyses
- If I were to present evidence of a relationship between breath mint useage (mints per day, X) and pulmonary function (measured in FEV), you would be skeptical.
- Likely, you would say, 'smokers tend to use more breath mints than non smokers, smoking is related to a loss in pulmonary function. That's probably the culprit.'
- If asked what would convince you, you would likely say, 'If non-smoking breath mint users had lower lung function than non-smoking non-breath mint users and, similarly, if smoking breath mint users had lower lung function than smoking non-breath mint users, I'd be more inclined to believe you'.
- In other words, to even consider my results, I would have to demonstrate that they hold while holding smoking status fixed.
Multivariable regression analyses
- An insurance company is interested in how last year's claims can predict a person's time in the hospital this year.
- They want to use an enormous amount of data contained in claims to predict a single number. Simple linear regression is not equipped to handle more than one predictor.
- How can one generalize SLR to incoporate lots of regressors for the purpose of prediction?
- What are the consequences of adding lots of regressors?
- Surely there must be consequences to throwing variables in that aren't related to Y?
- Surely there must be consequences to omitting variables that are?
Demonstration that it works using an example
Linear model with two variables and an intercept
[1] 1.004
x x2 x3
1.0040 0.9899 1.0078
Showing that order doesn't matter
[1] 1.004
x x2 x3
1.0040 0.9899 1.0078
Residuals again
[1] 1.004
x x2 x3
1.0040 0.9899 1.0078
Linear models
- Linear models are the single most important applied statistical and machine learning techniqe, by far.
- Some amazing things that you can accomplish with linear models
- Decompose a signal into its harmonics.
- Flexibly fit complicated functions.
- Fit factor variables as predictors.
- Uncover complex multivariate relationships with the response.
- Build accurate prediction models.