**R Methods for Factor Response Variable**
library(pacman)
There are multiple methods that we can use to deal with binary response variable.
#Logistic Regression
Case Study: Framingham Heart Study fram_data:
- Identify Coronary Heart Disease risk factors. Famous study: http://www.framinghamheartstudy.org/about-fhs/index.php
i) Data: 1,406 health professionals. Conditions gathered at the beginning of the study (early 50s). Both the original subjects and their next generations have been included in the study.
Variables:
Heart Disease
, AGE
, SEX
, Blood pressures expressed as SBP
and DBP
, CHOL
(cholesterol level), FRW
(age and gender adjusted weight) and CIG
(number of cigarettes smoked each week, self-reported.)
ii) Goal:
- Identify risk factors
- Predict the probability of one with
Heart Disease
iii) In particular,
Predict Prob(HD=1) for a person who is:
iv) HD=1
: with Heart Disease and HD=0
: No Heart Disease
1. A quick EDA
Read Framingham.dat
as fram_data
fram_data <- read.csv("Framingham.dat", sep=",", header=T, as.is=T)
str(fram_data)
names(fram_data)
summary(fram_data)
Rename and set variables to their correct type
# names(fram_data)[1] <- "HD"
# fram_data$HD <- as.factor(fram_data$HD)
# fram_data$SEX <- as.factor(fram_data$SEX)
fram_data %<>% # make the following change to the data frame
rename(HD = Heart.Disease.) %>%
mutate(HD = as.factor(HD),
SEX = as.factor(SEX))
Structure of the data
str(fram_data)
Last row, which will be used for prediction. Notice that the response HD
is missing here.
tail(fram_data, 1)
We take out the female whose HD will be predicted, and save it to fram_data.new
fram_data.new <- fram_data[1407,]
fram_data <- fram_data[-1407,]
We have 311 observations with HD = 0 and 1,095 with HD = 1
summary(fram_data)
Notice missing values in FRW and CIG. We won’t take those observations off the data set.
sum(is.na(fram_data))
For simplicity we start with a simple question:
How does HD relate to SBP?
tapply(fram_data$SBP, fram_data$HD, mean)
On average SBP seems to be higher among HD = 1
We can see the distribution of SBP
through back to back box plots. Again, SBP seems to be higher when HD = 1
plot(fram_data$HD, fram_data$SBP, ylab ="SBP", xlab = "HD")
Another way of producing the boxplots
boxplot(fram_data$SBP~fram_data$HD, ylab ="SBP", xlab = "HD")
Next we explore how proportions of HD = 1 relate to SBP
Standard plots:
plot(fram_data$SBP,
fram_data$HD,
col=fram_data$HD,
xlab = "SBP",
ylab = "HD")
legend("topright",
legend=c("0", "1"),
lty=c(1,1),
lwd=c(2,2),
col=unique(fram_data$HD))
Problems: many observations are over-plotted so they are not visible.
Solutions: use jitter() to spread out the observations with similar SBP values.
plot(jitter(as.numeric(fram_data$HD), factor=.5) ~ fram_data$SBP,
pch=4,
col=fram_data$HD,
ylab="HD",
xlab="SBP")
legend("topright", legend=c("0", "1"),
lty=c(1,1), lwd=c(2,2), col=unique(fram_data$HD))
It’s still hard to tell the proportion of each category given SBP
.
Alternatively we could use stripchart() as follows:
stripchart(fram_data$SBP~fram_data$HD, method="jitter",
col=c("black", "red"), pch=4,
ylab="HD", xlab="SBP")
legend("topright", legend=c("0", "1"),
lty=c(1,1), lwd=c(2,2), col=unique(fram_data$HD))
All the plots above do not show the proportion of “1”'s vs. “0”'s as a function of SBP.
2. Logistic Regression of HD vs. SBP
Logit model
Analogous to liner model in a regular regression, we model
P
(
Y
=
1
∣
S
B
P
)
=
e
β
0
+
β
1
S
B
P
1
+
e
β
0
+
β
1
S
B
P
P(Y=1\vert SBP) = \frac{e^{\beta_0 + \beta_1 SBP}}{1+e^{\beta_0+\beta_1 SBP}}
P(Y=1∣SBP)=1+eβ0+β1SBPeβ0+β1SBP
where
β
0
\beta_0
β0 and
β
1
\beta_1
β1 are unknown parameters.
Immediately we have
i)
P ( Y = 0 ∣ S B P ) = 1 − P ( Y = 1 ∣ S B P ) = 1 1 + e β 0 + β 1 S B P P(Y=0\vert SBP) = 1-P(Y=1\vert SBP) = \frac{1}{1+e^{\beta_0+\beta_1 SBP}} P(Y=0∣SBP)=1−P(Y=1∣SBP)=1+eβ0+β1SBP1
ii)
log
(
P
(
Y
=
1
∣
S
B
P
)
P
(
Y
=
0
∣
S
B
P
)
)
=
β
0
+
β
1
×
S
B
P
\log\left(\frac{P(Y=1\vert SBP)}{P(Y=0\vert SBP)}\right)=\beta_0+\beta_1 \times SBP
log(P(Y=0∣SBP)P(Y=1∣SBP))=β0+β1×SBP
The above function is also termed as logit(P(Y=1|SBP))
iii) The interpretation of
β
1
\beta_1
β1 is the change in log odds ratio for a unit change in SBP
.
iv) Prob(Y=1|SBP) is a monotone function of SBP
.
Maximum likelihood methods
To estimate the unknown parameters β \beta β's we introduce Likelihood Function of β 0 \beta_0 β0 and β 1 \beta_1 β1 given the data, namely the
Probability of seeing the actual outcome in the data:
Take a look at a piece of the data, randomly chosen from our dataset. We can then see part of the likelihood function.
set.seed(2)
fram_data[sample(1:1406, 10), c("HD", "SBP")]
We use this to illustrate the notion of likelihood function.
[\begin{split}
\mathcal{Lik}(\beta_0, \beta_1 \vert {\text Data}) &= {\text {Prob(the outcome of the data)}}\
&=Prob((Y=0|SBP=130), (Y=0|SBP=140), \ldots, (Y=1|SBP=130), \ldots ) \
&=Prob(Y=0|SBP=130) \times Prob(Y=0|SBP=140) \times, \ldots, Prob(Y=1|SBP=130), \ldots ) \
&= \frac{1}{1+e^{\beta_0 + 130 \beta_1}}\cdot\frac{1}{1+e^{\beta_0 + 140\beta_1}}\cdots\frac{e^{\beta_0 + 130 \beta_1}}{1 + e^{\beta_0 + 130 \beta_1}} \dots
\end{split}]
MLE: The maximizes
β
^
1
\hat \beta_1
β^1 and
β
^
0
\hat \beta_0
β^0 will be used to estimate the unknown parameters, termed as Maximum Likelihood Estimators
:
max
β
0
,
β
1
L
i
k
(
β
0
,
β
1
∣
D
a
t
a
)
\max _{\beta_0, \beta_1} \mathcal{Lik}(\beta_0, \beta_1 \vert Data)
β0,β1maxLik(β0,β1∣Data)
Remark:
-
β ^ 0 \hat \beta_0 β^0 and β ^ 1 \hat \beta_1 β^1 are obtained through
max log ( L i k ( β 0 , β 1 ∣ D ) ) \max\log(\mathcal{Lik}(\beta_0, \beta_1 \vert D)) maxlog(Lik(β0,β1∣D)) -
MLE can only be obtained through numerical calculations.
-
glm()
will be used to do logistic regression.
It is very similar tolm()
but some output might be different.
The default is logit link in glm
fit1 <- glm(HD~SBP, fram_data, family=binomial(logit))
summary(fit1, results=TRUE)
To see the prob function estimated by glm:
-
logit = -3.66 + 0.0159 SBP
-
P ( H D = 1 ∣ S B P ) = e − 3.66 + 0.0159 × S B P 1 + e − 3.66 + 0.0159 × S B P P(HD = 1 \vert SBP) = \frac{e^{-3.66+0.0159 \times SBP}}{1+e^{-3.66+0.0159 \times SBP}} P(HD=1∣SBP)=1+e−3.66+0.0159×SBPe−3.66+0.0159×SBP
-
P ( H D = 0 ∣ S B P ) = 1 1 + e − 3.66 + 0.0159 × S B P P(HD = 0 \vert SBP) = \frac{1}{1+e^{-3.66+0.0159 \times SBP}} P(HD=0∣SBP)=1+e−3.66+0.0159×SBP1
-
Notice the prob of Y=1 is an increasing function of
SBP
since β ^ 1 = . 0159 > 0 \hat \beta_1 = .0159 > 0 β^1=.0159>0. -
Unfortunately we do not have a nice interpretation of β 1 \beta_1 β1 over
Prob (Y=1)
.
par(mfrow=c(1,1))
plot(fram_data$SBP, fit1$fitted.values, pch=16,
xlab="SBP",
ylab="Prob of P(Y=1|SBP)")
Alternatively, we can plot the prob through e − 3.66 + 0.0159 × S B P 1 + e − 3.66 + 0.0159 × S B P \frac{e^{-3.66+0.0159 \times SBP}}{1+e^{-3.66+0.0159 \times SBP}} 1+e−3.66+0.0159×SBPe−3.66+0.0159×SBP
x <- seq(100, 300, by=1)
y <- exp(-3.66+0.0159*x)/(1+exp(-3.66+0.0159*x))
plot(x, y, pch=16, type = "l",
xlab = "SBP",
ylab = "Prob of P(Y=1|SBP)" )
Inference for the Coefficients
i. Wald intervals/tests (through the MLE’s)
Facts about MLE:
- The MLE’s are approximately normal and they are unbiased estimators of the β \beta β's.
- The standard errors are obtained through information matrix
- The z z z intervals and z z z tests are valid for each β i \beta_i βi.
summary(fit1)
Usual z-intervals for the coefficient’s (output from the summary)
confint.default(fit1)
ii. Likelihood Ratio Tests
Similar to F tests in OLS (Ordinary Least Squares), we have Chi-Square tests to test if a collective set of variables are not needed.
Here:
H
0
:
β
S
B
P
=
0
H_0: \beta_{SBP}=0
H0:βSBP=0
H
1
:
β
S
B
P
≠
0
H_1: \beta_{SBP} \not= 0
H1:βSBP=0
Facts about MLE: Likelihood Ratio Tests
-
Under the H 0 H_0 H0, the likelihood ratio (modified with log)
[\begin{split}
\text {Testing stat} = \chi^2
& = -2\times \log \frac{\max {H_1} \mathcal{Lik}(\beta_0, \beta_1 \vert D)}{\max{H_0} \mathcal{Lik}(\beta_0, \beta_1 \vert D)}\
&=-2\log(\mathcal{Lik}{H_0}) - (-2\log(\mathcal{Lik}{H_1}))\
&\sim \chi^2_{df=1}
\end{split}] -
Testing stat = Null Deviance - Residual Deviance. Here Deviance = − 2 log ( L ) -2\log(\mathcal{L}) −2log(L)
-
The p-value is done through χ 2 \chi^2 χ2 distribution
p v a l u e = P ( χ d f 2 > χ 2 ) p_{value}=P(\chi^2_{df} > \chi^2) pvalue=P(χdf2>χ2) -
glm() outputs the two terms.
summary(fit1)
Notice:
-
Null deviance = 1485.9
-
Residual deviance: 1432.8
-
χ 2 = 1485.9 − 1432.8 = 53.1 \chi^2 = 1485.9-1432.8= 53.1 χ2=1485.9−1432.8=53.1
Get the χ 2 \chi^2 χ2 statistic
chi.sq <- 1485.9-1432.8
p v a l u e p_{value} pvalue: from the likelihood Ratio test
pchisq(chi.sq, 1, lower.tail=FALSE)
Or use anova/Anova{car} to get the
χ
2
\chi ^2
χ2 tests. Here the null hypothesis is that SBP
is not needed or all (but the intercept) coefficients are 0
anova(fit1, test="Chisq") #
Similar to the F-test in lm() set up.
Anova(fit1)
Inverting the likelihood ratio tests we can get confidence intervals for the coefficients. This should be similar to that obtained through Wald or z intervals but they are not the same.
confint(fit1)
iii. Prediction
To do a prediction based on fit1 we plug in SBP
value into the prob equation.
fit1.predict <- predict(fit1, fram_data.new, type="response")
fit1.predict
3. Multiple Logistic Regression
To avoid problems, we will delete all cases with missing values
fram_data.f <- na.omit(fram_data)
a. With all the predictors
i. Full model
l
o
g
i
t
=
β
0
+
β
1
x
1
+
⋯
+
β
p
x
p
logit = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p
logit=β0+β1x1+⋯+βpxp
P
(
H
D
=
1
)
=
e
β
0
+
β
1
x
1
+
⋯
+
β
p
x
p
1
+
e
β
0
+
β
1
x
1
+
…
β
p
x
p
P(HD = 1) = \frac{e^{\beta_0 + \beta_1 x_1 + \dots + \beta_p x_p}}{1 + e^{\beta_0 + \beta_1 x_1 + \dots \beta_p x_p}}
P(HD=1)=1+eβ0+β1x1+…βpxpeβ0+β1x1+⋯+βpxp
-
MLE’s will be obtained through maximize the Lik function
-
Wald tests/intervals hold for each coefficient
-
Likelihood Ratio tests hold for a set of predictors
With all the predictors
fit2 <- glm(HD~., fram_data.f, family=binomial)
summary(fit2)
-
The prob of Y=1 given all the factors is estimated as
P ( H D = 1 ) = e − 9.33480 + 0.06249 × A G E + 0.90610 I M a l e + ⋯ + 0.01231 × C I G 1 + e − 9.33480 + 0.06249 × A G E + 0.90610 I M a l e + ⋯ + 0.01231 × C I G P(HD = 1) = \frac{e^{-9.33480 + 0.06249 \times AGE + 0.90610 I_{Male} + \dots + 0.01231 \times CIG}}{1 + {e^{-9.33480 + 0.06249 \times AGE + 0.90610 I_{Male} + \dots + 0.01231 \times CIG}}} P(HD=1)=1+e−9.33480+0.06249×AGE+0.90610IMale+⋯+0.01231×CIGe−9.33480+0.06249×AGE+0.90610IMale+⋯+0.01231×CIG -
Are all the risk factors not useful? Or all the β \beta β's are 0 (Other than the intercept)
chi.sq <- 1469-1343
chi.sq #126
pvalue <- pchisq(chi.sq, 7, lower.tail=FALSE)
pvalue
We reject the null hypothesis with any reasonable small α \alpha α.
Alternatively we may use anova to get the χ 2 \chi^2 χ2 test by fitting the reduced model vs. the full model:
fit0 <- glm(HD~1, fram_data.f, family=binomial) # null
It works only if two fits use same samples.
anova(fit0, fit2, test="Chisq") #
ii. Are CIG and FRW not needed?
Use Likelihood Ratio test to see if CIG and FRW are not needed in fit2
fit3 <- update(fit2, .~. -CIG -FRW)
summary(fit3)
We first calculate the Chi-square stat directly:
chi.sq.2 <- fit3$deviance-fit2$deviance
chi.sq.2
pchisq(chi.sq.2, 2, lower.tail=FALSE) # 0.06194729
We fail to reject the H 0 H_0 H0 at p < 0.05 p < 0.05 p<0.05 but we do reject the null at a l p h a = . 1 alpha = .1 alpha=.1 since the p-value is < .1.
We could use anova…
anova(fit3, fit2, test="Chisq")
Once again we don’t have evidence to reject the null at 0.05.
b. Model selection
i. Backward elimination
Backward selection is the easiest (without any packages) if p is not too large. As in lm, we eliminate the variable with the highest p-value.
summary(fit2)
fit2.1 <- update(fit2, .~. -DBP)
Backward selection by kicking DBP (the one with largest p-value) out
summary(fit2.1)
fit2.2 <- update(fit2.1, .~. -FRW)
summary(fit2.2)
CIG didn’t really add much to decrease the residual deviance
fit2.3 <- update(fit2.2, .~. -CIG)
summary(fit2.3)
Prediction for the subject using fit2.3
fit2.3.predict <- predict(fit2.3, fram_data.new, type="response")
fit2.3.predict
Notice the difference between the predictions among fit2.3 and fit1!!!
predict(fit1, fram_data.new, type="response") #
ii. AIC through bestglm()
A
I
C
=
−
2
log
(
L
)
+
2
d
AIC = - 2\log(\mathcal{L}) + 2d
AIC=−2log(L)+2d
Where
d
d
d: total number of the parameters. We are looking for a model with small AIC.
All predictors
fit2 <- glm(HD~., fram_data.f, family=binomial)
summary(fit2)
In fit2:
Residual Deviance=
−
2
log
(
L
)
-2\log(\mathcal{L})
−2log(L), so
AIC =
−
2
log
(
L
)
+
2
d
-2\log(\mathcal{L}) + 2d
−2log(L)+2d
= Residual deviance + 2*8
= 1343.1 + 16 = 1359.1
Which model(s) gives us the smallest AIC?
bestglm() will be used
- the syntax is similar to that from leaps.
- we know less well about bestglm comparing with that of regsubsets
Get the design matrix without 1’s and HD.
Xy <- model.matrix(HD ~.+0, fram_data.f)
Attach y as the last column.
Xy <- data.frame(Xy, fram_data.f$HD)
str(Xy)
fit.all <- bestglm(Xy, family = binomial, method = "exhaustive", IC="AIC", nvmax = 10) # method = "exhaustive", "forward" or "backward"
names(fit.all) # fit.all$Subsets to list best submodels
List the top 5 models. In the way any one of the following model could be used.
fit.all$BestModels
Fit the best model which is the same as
summary(fit.all$BestModel)
summary(glm(HD~AGE+SEX+SBP+CHOL+FRW+CIG, family=binomial, data=fram_data.f))
iii. More models
- Extend models to include iteractions
Are there interaction effects of SBP by SEX? or AGE by SEX?
Anova(glm(HD ~ AGE + SBP + SEX + SBP*SEX + AGE * SEX, fram_data, family = binomial))
#Linear Discriminate Analysis
Case Study I: Riding Movers
A riding-mower manufacturer would like to find a way of classifying families in a city into those likely to purchase a riding mower and those not likely to buy one. A pilot random sample is undertaken of 12 owners and 12 nonowners in the city.
mower <- read.csv('RidingMowers.csv')
str(mower)
names(mower)
summary(mower)
EDA
Make a scatter plot for this dataset.
We can think of a linear classification rule as a line that separates the two-dimensional region into two parts, with most of the owners in one half-plane and most nonowners in the complementary half-plane.
ggplot(mower) +
geom_point(aes(x = Income, y = Lot_Size, col = Ownership), size = 3.5) +
xlab("Income($000s)") +
ylab("Lot Size(000s sqft)")
LDA
Classfication Function
We use DiscriMiner
to do LDA, which gives us Linear Discriminant Analysis function.
It will output the LDA model with classification error rate, confusion table.
da.reg1 <- linDA(mower[,1:2], mower[,3])
#da.reg1 <- linDA(mower[,1:2], mower[,3],prior = c(1/2,1/2))
names(da.reg1)
da.reg1$functions
To classify a family into the class of owners or nonowners, we use the classification functions to compute the family’s classification scores.
A family is classified into the class of owners if the owner function score is higher than the nonowner function score, and into nonowners if the reverse is the case.
c ^ ( N o n n o w e r ∣ I n c o m e , L o t _ S i z e ) = − 51.42 + 0.3294 ∗ I n c o m e + 4.682 ∗ L o t _ S i z e c ^ ( O w n e r ∣ I n c o m e , L o t _ S i z e ) = − 73.16 + 0.4296 ∗ I n c o m e + 5.467 ∗ L o t _ S i z e \begin{aligned} \hat{c}(Nonnower|Income,Lot\_Size) & = -51.42+0.3294*Income+4.682*Lot\_Size\\ \hat{c}(Owner|Income,Lot\_Size) & = -73.16+0.4296*Income+5.467*Lot\_Size \end{aligned} c^(Nonnower∣Income,Lot_Size)c^(Owner∣Income,Lot_Size)=−51.42+0.3294∗Income+4.682∗Lot_Size=−73.16+0.4296∗Income+5.467∗Lot_Size
An alternative way for classifying a record into one of the classes is to compute the probability of belonging to each of the classes and assigning the record to the most likely class.
P ( N o n o w n e r ∣ I n c o m e , L o t _ S i z e ) = c ^ ( N o n o w n e r ∣ I n c o m e , L o t _ S i z e ) c ^ ( O w n e r ∣ I n c o m e , L o t _ S i z e ) + c ^ ( N o n o w e n e r ∣ I n c o m e , L o t _ S i z e ) P ( O w n e r ∣ I n c o m e , L o t _ S i z e ) = c ^ ( O w n e r ∣ I n c o m e , L o t _ S i z e ) c ^ ( O w n e r ∣ I n c o m e , L o t _ S i z e ) + c ^ ( N o n o w e n e r ∣ I n c o m e , L o t _ S i z e ) \begin{aligned} P(Nonowner|Income,Lot\_Size)&=\frac{\hat{c}(Nonowner|Income,Lot\_Size)} {\hat{c}(Owner|Income,Lot\_Size)+\hat{c}(Nonowener|Income,Lot\_Size)}\\ P(Owner|Income,Lot\_Size)&=\frac{\hat{c}(Owner|Income,Lot\_Size)} {\hat{c}(Owner|Income,Lot\_Size)+\hat{c}(Nonowener|Income,Lot\_Size)} \end{aligned} P(Nonowner∣Income,Lot_Size)P(Owner∣Income,Lot_Size)=c^(Owner∣Income,Lot_Size)+c^(Nonowener∣Income,Lot_Size)c^(Nonowner∣Income,Lot_Size)=c^(Owner∣Income,Lot_Size)+c^(Nonowener∣Income,Lot_Size)c^(Owner∣Income,Lot_Size)
propensity.owner <- exp(da.reg1$scores[,2])/(exp(da.reg1$scores[,1])+exp(da.reg1$scores[,2]))
output1 <- data.frame(Actual=mower$Ownership,
Pred=da.reg1$classification,
da.reg1$scores,
propensity.owner=propensity.owner)
output1
Classification Boundary
We could also caculate the classification boundary.
c ^ ( N o n o w n e r ∣ I n c o m e , L o t _ S i z e ) = c ^ ( O w n e r ∣ I n c o m e , L o t _ S i z e ) − 51.42 + 0.3294 ∗ I n c o m e + 4.682 ∗ L o t _ S i z e = − 73.16 + 0.4296 ∗ I n c o m e + 5.467 ∗ L o t _ S i z e 0.1002 ∗ I n c o m e + 0.785 ∗ L o t _ S i z e = 21.74 L o t _ S i z e = 27.69 − 0.1276 ∗ I n c o m e \begin{aligned} \hat{c}(Nonowner|Income,Lot\_Size)&=\hat{c}(Owner|Income,Lot\_Size)\\ -51.42+0.3294*Income+4.682*Lot\_Size&= -73.16+0.4296*Income+5.467*Lot\_Size \\ 0.1002*Income+0.785*Lot\_Size &= 21.74\\ Lot\_Size &= 27.69 - 0.1276*Income \end{aligned} c^(Nonowner∣Income,Lot_Size)−51.42+0.3294∗Income+4.682∗Lot_Size0.1002∗Income+0.785∗Lot_SizeLot_Size=c^(Owner∣Income,Lot_Size)=−73.16+0.4296∗Income+5.467∗Lot_Size=21.74=27.69−0.1276∗Income
ggplot(mower) +
geom_point(aes(x = Income, y = Lot_Size, col = Ownership), size = 3.5) +
geom_abline(intercept = 27.69, slope = -0.1276, color = 'orange', linetype= 'dashed', size = 1.2) +
xlab("Income($000s)") +
ylab("Lot Size(000s sqft)")
Prediction
We use classify
to do a prediction. For instance, the first household has an income of $60K and a lot size of 18.4K
f
t
2
ft^2
ft2.
newmower <- mower[1,]
newmower[1] <- 60
newmower[2] <- 18.4
newmower[3] <- 'NA'
newmower
pred1 <- classify(da.reg1,as.vector(newmower[1:2]))
pred1
c
^
(
N
o
n
o
w
n
e
r
∣
I
n
c
o
m
e
,
L
o
t
_
S
i
z
e
)
=
−
51.42
+
0.3294
∗
60
+
4.682
∗
18.4
=
54.48
c
^
(
O
w
n
e
r
∣
I
n
c
o
m
e
,
L
o
t
_
S
i
z
e
)
=
−
73.16
+
0.4296
∗
60
+
5.467
∗
18.4
=
53.20
\begin{aligned} \hat{c}(Nonowner|Income,Lot\_Size) & = -51.42+0.3294*60+4.682*18.4=54.48\\ \hat{c}(Owner|Income,Lot\_Size) & = -73.16+0.4296*60+5.467*18.4=53.20 \end{aligned}
c^(Nonowner∣Income,Lot_Size)c^(Owner∣Income,Lot_Size)=−51.42+0.3294∗60+4.682∗18.4=54.48=−73.16+0.4296∗60+5.467∗18.4=53.20
Below we discussion some concepts related to classification.
a. Sensitivity
P r o b ( Y ^ = 1 ∣ Y = 1 ) Prob(\hat Y = 1 \vert Y=1) Prob(Y^=1∣Y=1)
Not an error. This is also called True Positive Rate
: the proportion of corrected positive classification given the status being positive.
b. Specificity
P r o b ( Y ^ = 0 ∣ Y = 0 ) Prob(\hat Y = 0| Y=0) Prob(Y^=0∣Y=0)
Specificity
: the proportion of corrected negative classification given the status being negative.
c. False Positive
1 − S p e c i f i c i t y = P ( Y ^ = 1 ∣ Y = 0 ) 1 - Specificity = P(\hat Y=1 \vert Y=0) 1−Specificity=P(Y^=1∣Y=0)
False Positive
: the proportion of wrong classifications among given the status being negative.
d. Misclassification error
Mean value of missclassifications:
M C E = 1 n ∑ { y ^ i ≠ y i } . MCE= \frac{1}{n} \sum \{\hat y_i \neq y_i\}. MCE=n1∑{y^i=yi}.
We can get all these quantities through confusion matrix or directly find the misclassification errors.
f. Confusion Matrix
We could use table
to to create a 2 by 2 table which summarizes the number of mis/agreed labels.
table(da.reg1$classification,mower$Ownership)
We could use $confusion
to get the confusion matrix.
da.reg1$confusion
We could also use confusionMatrix
function from caret
packages to summarize the number of mis/agreed labels.
confusionMatrix(da.reg1$classification,mower$Ownership)
g. The Roc Curve and AUC
For each model or process, given a threshold, or a classifier, there will be a pair of sensitivity and specificity.
By changing the threshold, we graph all the pairs of False Positive as x-axis and True Positive as y-axis to have a curve: the ROC curve.
We use the roc
function from the package pROC
.
Notice that the ROC curve here is Sensitivity vs Specificity. Most of the ROC is drawn using False Positive rate as x-axis.
fit.roc1 <- roc(mower$Ownership, output1$propensity.owner, plot = T, col = 'blue')
fit.roc1$auc
#auc(fit.roc1)
False Positive vs Sensitivity curve is plotted in most ROC curves:
plot(1-fit.roc1$specificities, fit.roc1$sensitivities, col="red", pch=16,
xlab="False Positive",
ylab="Sensitivity")
We can get more from fit.roc1. For example, a curve shows the probability thresholds used and the corresponding False Positive rate.
plot(fit.roc1$thresholds, 1-fit.roc1$specificities, col="green", pch=16,
xlab="Threshold on prob",
ylab="False Positive",
main = "Thresholds vs. False Postive")
h. Positive Prediction
Positive Prediction is a measure of the accuracy given the predictions.
Positive Prediction = P ( P o s i t i v e ∣ C l a s s i f i e d a s P o s i t i v e ) P({\rm Positive \vert Classified\space as\space Positive}) P(Positive∣Classified as Positive)
For da.reg1, recall the confusion matrix being
cm.1 <- table(da.reg1$classification, mower$Ownership)
cm.1
positive.pred <- cm.1[1, 1] / (cm.1[1, 1] + cm.1[1, 2])
positive.pred
i. Negative Prediction
Negative Prediction = P ( N e g a t i v e ∣ C l a s s i f i e d a s N e g a t i v e ) P({\rm Negative \vert Classified\space as\space Negative}) P(Negative∣Classified as Negative)
negative.pred <- cm.1[2, 2] / (cm.1[2, 1] + cm.1[2, 2])
negative.pred
#Principal Component Analysis
Case study: Measurement of Intelligence
Goal: How people differ in intelligence?
- Our data set IQ.csv is a subset of individuals from the 1979 National Longitudinal Study of Youth (NLSY79) survey who were re-interviewed in 2006.
- Information about family, personal demographic such as gender, race and eduction level, plus a set of ASVAB (Armed Services Vocational Aptitude Battery) test scores.
- It is STILL used as a screening test for those who want to join the US army!
The test has the following components:
- Science, Arith (Arithmetic reasoning), Word (Word knowledge), Parag (Paragraph comprehension), Numer (Numerical operation), Coding (Coding speed), Auto (Automative and Shop information), Math (Math knowledge), Mechanic (Mechanic Comprehension) and Elec (Electronic information).
- Lastly AFQT (Armed Forces Qualifying Test) is a combination of Word, Parag, Math and Arith.
- Note: Service Branch requirement: Army 31, Navy 35, Marines 31, Air Force 36, and Coast Guard 45, (out of 100 which is the max!) My prediction is that all of us pass the requirements, even for Coast Guard:)
Our goal is to see how we can summarize the set of tests and grab main information about each one’s intelligence efficiently.
Note: One of the original study goals is to see how intelligence relates to one’s future successes measured by income in 2005.
0) Get a quick look at the data
data1 <- read.csv("IQ.Full.csv")
dim(data1)
names(data1)
summary(data1)
1) We first concentrate on the AFQT tests: Word, Math, Parag and Arith
For simplicity we take a subset of 50 subjects.
set.seed(1)
data.AFQT <- data1[sample(nrow(data1), 50, replace=FALSE), c("Word","Parag", "Math", "Arith")]
str(data.AFQT)
summary(data.AFQT)
We expect the four test scores correlated each other.
pairs(data.AFQT, pch=16, col="blue") # shows pairwise relationship between two sets of scores
rownames(data.AFQT) # label for each person
rownames(data.AFQT) <- paste("p", seq(1:nrow(data.AFQT)), sep="") # reassign everyone's labels to be shorter.
rownames(data.AFQT)
2) Scaling and centering variables
sapply(data.AFQT, sd) # sd's for each test
sapply(data.AFQT, mean) # means for each test
colMeans((data.AFQT))
# What are the variances and correlations?
var(data.AFQT)
cor(data.AFQT)
Each test score has its own variance. Sometimes the variables of interest may have different units.
We can center the mean to 0 and set the sd to 1 for each test by subtracting the score means and dividing the sd for each test. Then the cor and var will be the same.
data.AFQT.scale <- scale(data.AFQT, center=TRUE, scale=TRUE) #default
is.matrix(data.AFQT.scale) # it turns a data frame to a matrix
Now the mean of each test will be 0, and the var matrix will be same as the cor matrix.
data.AFQT.scale <- as.data.frame(data.AFQT.scale)
colMeans(data.AFQT.scale) # all zeros
sapply(data.AFQT.scale, sd) # all 1's
# Now the var matrix and cor matrix are the same after scaled.
var(data.AFQT.scale)
cor(data.AFQT.scale)
pairs(data.AFQT.scale, pch=16, col="blue")
3) Principal Components: dimension reduction
i) For simplicity let’s first look at one pair of the scores
# Parag and Word
par(mfrow=c(1,1))
plot(data.AFQT$Parag, data.AFQT$Word, pch=16,
xlab="Parag",
ylab="Word",
xlim=c(0, 40),
ylim=c(0, 40))
Questions:
i) How can we use one score which combines both Parag and Word linearly, such that it will give us the largest variance?
ii) Can we find a line which is closest to all the points?
This is equivalent to find
ϕ
11
\phi_{11}
ϕ11 and
ϕ
21
\phi_{21}
ϕ21 such that
V
a
r
(
Z
1
)
=
V
a
r
(
ϕ
11
∗
X
1
+
ϕ
21
∗
X
2
)
Var(Z_1)= Var(\phi_{11}*X_1+ \phi_{21}*X_2)
Var(Z1)=Var(ϕ11∗X1+ϕ21∗X2) is maximized with constraint
ϕ
11
2
+
ϕ
21
2
=
1
\phi_{11}^2 + \phi_{21}^2 = 1
ϕ112+ϕ212=1
Here X1 and X2 are centered and scaled Parag and Word scores.
X
1
=
(
P
a
r
g
−
m
e
a
n
(
P
a
r
a
g
)
)
/
s
d
(
P
a
r
a
g
)
X_1=(Parg - mean(Parag))/sd(Parag)
X1=(Parg−mean(Parag))/sd(Parag)
X
2
=
(
W
o
r
d
−
m
e
a
n
(
W
o
r
d
)
)
/
s
d
(
W
o
r
d
)
X_2=(Word - mean(Word))/sd(Word)
X2=(Word−mean(Word))/sd(Word)
attach(data.AFQT)
mean(Parag) #11.16
sd(Parag) # 3.19
mean(Word) #26.64
sd(Word) #7.25
X1=(Parag - mean(Parag))/sd(Parag)
X2=(Word - mean(Word))/sd(Word)
# Same as
scale(data.AFQT[, c("Parag", "Word") ],
center=TRUE, scale=TRUE) # By default center=TRUE and scale=TRUE
plot(X1, X2, pch=16,
xlab="Parag centered and scaled",
ylab="Word centered and scaled" )
abline(h=0, v=0, lwd=4) # better make the x y lim to be the same
plot(X1, X2, pch=16, xlim=c(-3, 3), ylim=c(-3, 3),
xlab="Parag centered and scaled",
ylab="Word centered and scaled" )
abline(h=0, v=0, lwd=4)
par(mfrow=c(1,1))
plot(X1, X2, pch=16, xlim=c(-3, 3), ylim=c(-3, 3),
xlab="Parag centered and scaled",
ylab="Word centered and scaled" )
abline(0, 1, lwd=4, col="red")
abline(0, -1, lwd=4, col="blue")
ii) Terminology
Z 1 Z_1 Z1: First principle component.
( ϕ 11 , ϕ 21 ) (\phi_{11}, \phi_{21}) (ϕ11,ϕ21) is called the loadings.
The entire Z 1 Z_1 Z1, one for each person, is called PC scores.
# Here we go: PCA
pc.parag.word <- prcomp(data.AFQT[, c("Parag", "Word")], scale=TRUE)
names(pc.parag.word)
pc.parag.word$rotation # Loadings = phi's
phi_11 <- 0.7071068
phi_21 <- 0.7071068
pc.parag.word$x # PC scores: the first column = Z1
Z1 <- phi_11 * X1 + phi_21 * X2 # PC scores. The two scores should be the same possibly by a different sign
Z1.1 <- Z1
max(abs(pc.parag.word$x[, 1]-Z1.1)) # Z1 and pc.parag.word$x[, 1] are the same.
pc.parag.word$sdev # sd(Z1)
pc.parag.word$center # means of the original scores
iii) Second Principal Component
Similar to the first Principal Component, we are now looking for
ϕ
12
,
ϕ
22
\phi_{12}, \phi_{22}
ϕ12,ϕ22, such that
Z
2
=
ϕ
12
∗
X
1
+
ϕ
22
∗
X
2
Z_2=\phi_{12} * X1 + \phi_{22} * X2
Z2=ϕ12∗X1+ϕ22∗X2
where
V a r ( Z 2 ) = V a r ( ϕ 12 ∗ X 1 + ϕ 22 ∗ X 2 ) Var(Z_2) = Var(\phi_{12} * X1 + \phi_{22} * X2) Var(Z2)=Var(ϕ12∗X1+ϕ22∗X2)
is maximized subject to constraints ϕ 12 2 + ϕ 22 2 = 1 \phi_{12}^2+\phi_{22}^2=1 ϕ122+ϕ222=1, and Z 2 Z_2 Z2 and Z 1 Z_1 Z1 are orthogonal.
Z 2 Z_2 Z2: Second principle component (PC2)
# The loadings
pc.parag.word$rotation # Loadings
Z2 <- pc.parag.word$x[,2]
# Let us verify
plot(Z1, Z2, pch=16,
xlab="First Principal Component Z1",
ylab="Second Principal Component Z2",
xlim=c(-4, 4),
ylim=c(-4, 4))
abline(v=0, h=0, lwd=3, col="red")
# The sd's
pc.parag.word$sdev # sd's of pc scores
- The principal plot is a rotation for the original X1, X2 plot
- v a r ( Z 1 ) = 1.74986 > v a r ( Z 2 ) = 0.2501404 var(Z_1)=1.74986 > var(Z_2)=0.2501404 var(Z1)=1.74986>var(Z2)=0.2501404
- Z 1 Z_1 Z1 and Z 2 Z_2 Z2 are orthogonal
cor(Z1, Z2) # is 0
Principal Component of the four tests: Word, Math, Parag and Arith
Question:
- How can we best capture the performance based on the four tests?
- Can we come up with some sensible scores combining the four tests together?
PCs:
Looking for a linear transformation of X1=Word, X2=Parag, X3=Math, and X4=Arith to have the max variance.
First Principal Component is
Z 1 = ϕ 11 ∗ X 1 + ϕ 21 ∗ X 2 + ϕ 31 ∗ X 3 + ϕ 41 ∗ X 4 Z_1=\phi_{11}*X_1 + \phi_{21}*X_2 + \phi_{31}*X_3 + \phi_{41}*X_4 Z1=ϕ11∗X1+ϕ21∗X2+ϕ31∗X3+ϕ41∗X4
such that V a r ( Z 1 ) Var(Z_1) Var(Z1) is maximized with ∑ ϕ i 1 2 = 1 \sum \phi_{i1}^2 = 1 ∑ϕi12=1.
Second Principal Component is
Z
2
=
ϕ
12
∗
X
1
+
ϕ
22
∗
X
2
+
ϕ
32
∗
X
3
+
ϕ
42
∗
X
4
Z_2=\phi_{12}*X_1 + \phi_{22}*X_2 + \phi_{32}*X_3 + \phi_{42}*X_4
Z2=ϕ12∗X1+ϕ22∗X2+ϕ32∗X3+ϕ42∗X4
such that
V
a
r
(
Z
2
)
Var(Z_2)
Var(Z2) is maximized,
∑
ϕ
i
2
2
=
1
\sum \phi_{i2}^2=1
∑ϕi22=1, and
Z
2
Z_2
Z2 and
Z
1
Z_1
Z1 are uncorrelated (Orthogonal). This is the same as the
{
ϕ
i
,
1
}
\{\phi_{i,1}\}
{ϕi,1} and
{
ϕ
i
,
2
}
\{\phi_{i,2}\}
{ϕi,2}s are orthogonal.
We keep going to obtain Z3, and Z4
-
Scale and Center each score
To find sensible PC’s, we recommend to
- center each variable by subtracting its mean.
- scale each variable by dividing its sd.
- Two things can be achieved simultaneously by scale()
- prcomp() has an option to scale or not
Ready to get PC’s for the data set
names(data.AFQT) # four tests
summary(data.AFQT) # unscaled yet
data.AFQT.scale <- scale(data.AFQT, center=TRUE, scale = TRUE)
summary(data.AFQT.scale)
data.AFQT.scale <- as.data.frame(data.AFQT.scale) # set it back as a data frame
names(data.AFQT.scale)
attach(data.AFQT.scale)
- Use prcomp()
We input the original variables. BUT set scale=TRUE
pc.4 <- prcomp(data.AFQT, scale=TRUE) # by default, center=True but scale=FALSE!!!
names(pc.4)
summary(pc.4)
var(pc.4$x[, 1])
# Loadings (directions)
round(pc.4$rotation, 5)
# PC1 scores:
phi1 <- pc.4$rotation[,1] # PC1 loadings, unique up to the sign
Z1.1 <- phi1[1]*Word+phi1[2]*Parag+phi1[3]*Math+phi1[4]*Arith
# Essentially it says that we should take the sum of the four test scores (after scaled)
# This is same as
Z1 <- pc.4$x[, 1]
max(abs(Z1.1-Z1)) # To convince you that two principal scores are the same.
# PC2 scores
Z2 <- pc.4$x[,2]
Interpretations of the pc scores:
plot(pc.4$x[, 1], pc.4$x[,2 ], pch=16,
xlim=c(-4, 4),
ylim=c(-4, 4),
main="The leading two principal components",
xlab="Z1=PC1",
ylab="Z2=PC2"
)
abline(h=0, v=0)
Z1: total scores of the 4 tests
Z2: sum of the word and parg - sum of the math’s
Remark:
i) Two scores Z1 and Z2 capture the main features of the four tests
ii) How much information we might have lost by using only the two PC scores?
NOT MUCH…
Properties of the pc scores:
i) var(Z1) > var(Z2) …
sd(Z1)
sd(Z2) # they are reported in
pc.4$sdev # standard dev's of all PC's in a decreasing order
ii) All 4 pc sores are uncorrelated.
cor(pc.4$x) # cor's are 0
pairs(pc.4$x, xlim=c(-4, 4), ylim=c(-4, 4), col=rainbow(6), pch=16)
They are all pairwise uncorrelated!
iii) Proportion of variance explained (PVE)
pve.4 <- 100* (pc.4$sdev)^2/sum ((pc.4$sdev)^2)
plot(pve.4, pch=16,
xlab="Principal Components",
ylab="Prop. of variance explained")
The leading component explains pve.4[1]=75% of the total variance.
Cumulative proportion of variance explained keeps track of the PVE including the first 1 PC, the first 2 PC’s, and so on.
cpve.4 <- 100*cumsum((pc.4$sdev)^2)/4 # cumulative proportions of the variance explained by
# the number of pc's used
plot(seq(1:4), cpve.4, pch=16, ylim=c(0, 100),
main="Cumulative Proportion of Variance Explained",
xlab="Number of PC's Used")
names(summary(pc.4))
summary(pc.4)$importance
# Scree plot of CPVE's
plot(summary(pc.4)$importance[3, ], pch=16,
ylab="Cumulative PVE",
xlab="Number of PC's",
main="Scree Plot of PCA for AFQT")
plot(pc.4) # variances of each pc
screeplot(pc.4) # var's each pc's
All the information is stored in summary(pc.4)
iv) biplot: visualize the PC scores together with the loadings of the original variables
lim <- c(-.4, .4)
biplot(pc.4, xlim=lim,
ylim=lim,
main="Biplot of the PC's")
abline(v=0, h=0)
# x-axis: PC1=Z1 (prop)
# y-axis: PC2=Z2
# top x-axis: prop to loadings for PC1
# right y-axis: prop to loadings for PC2
Summary:
-
To capture the main features of AFQT four scores we could use two summaries.
. Total scores (weighted)
. Difference of the math+arith and word+parag
Exercise: Perform PCA of all 10 tests
Does Gender play any roll in the test scores?
Part II: Calculation of the PC’s
PCs are nothing but eigen vectors/values of COR(X1,X2,… Xp) (if scaled) or COV(X1,X2) (unscaled)
PCA are eigenvectors of Cor matrix
PC.eig <- eigen(cor(data.AFQT))
PC.eig
summary(PC.eig)
PC.eig$vectors # Loadings
PC.eig$values # Variances of each PC's
# We use prcomp() here
PC <- prcomp(data.AFQT, scale=TRUE)
PC # should be exactly same as PC's from eigen decomposition (up to the sign)
phi <- PC$rotation
phi
cbind(PC.eig$vectors, PC$rotation) # Putting the first PC's together
# one from eigen-values???the other from prcomp(). They should be exactly the same (to the sign) and
# they are the same.
phi as we expected is an orthogonal unit matrix, i.e., the columns are uncorrelated with norm to be 1. Also,
i n v ( p h i ) = t ( p h i ) ! inv(phi)=t(phi)! inv(phi)=t(phi)!
This can be seen from
phi %*% t(phi) # matrix operator %*%. It should be an identity matrix
round(phi %*% t(phi), 2) # better seen if we round it.