R Methods for Factor Response Variable

                 **R Methods for Factor Response Variable**

library(pacman)

There are multiple methods that we can use to deal with binary response variable.

#Logistic Regression

Case Study: Framingham Heart Study fram_data:

  1. Identify Coronary Heart Disease risk factors. Famous study: http://www.framinghamheartstudy.org/about-fhs/index.php

i) Data: 1,406 health professionals. Conditions gathered at the beginning of the study (early 50s). Both the original subjects and their next generations have been included in the study.

Variables:

Heart Disease, AGE, SEX, Blood pressures expressed as SBP and DBP, CHOL (cholesterol level), FRW (age and gender adjusted weight) and CIG (number of cigarettes smoked each week, self-reported.)

ii) Goal:

  • Identify risk factors
  • Predict the probability of one with Heart Disease

iii) In particular,

Predict Prob(HD=1) for a person who is:

|AGE|SEX |SBP|DBP|CHOL|FRW|CIG| |---|--- |---|---|----|---|---| |45 |FEMALE|100|80 |180 |110|5 |

iv) HD=1: with Heart Disease and HD=0: No Heart Disease

1. A quick EDA

Read Framingham.dat as fram_data

fram_data <- read.csv("Framingham.dat", sep=",", header=T, as.is=T)
str(fram_data) 
names(fram_data)
summary(fram_data)

Rename and set variables to their correct type

# names(fram_data)[1] <- "HD"
# fram_data$HD <- as.factor(fram_data$HD)
# fram_data$SEX <- as.factor(fram_data$SEX)
fram_data %<>%   # make the following change to the data frame
   rename(HD = Heart.Disease.) %>%
   mutate(HD = as.factor(HD), 
          SEX = as.factor(SEX)) 

Structure of the data

str(fram_data)

Last row, which will be used for prediction. Notice that the response HD is missing here.

tail(fram_data, 1)

We take out the female whose HD will be predicted, and save it to fram_data.new

fram_data.new <- fram_data[1407,]
fram_data <- fram_data[-1407,]

We have 311 observations with HD = 0 and 1,095 with HD = 1

summary(fram_data)

Notice missing values in FRW and CIG. We won’t take those observations off the data set.

sum(is.na(fram_data))

For simplicity we start with a simple question:
How does HD relate to SBP?

tapply(fram_data$SBP, fram_data$HD, mean) 

On average SBP seems to be higher among HD = 1

We can see the distribution of SBP through back to back box plots. Again, SBP seems to be higher when HD = 1

plot(fram_data$HD, fram_data$SBP, ylab ="SBP", xlab = "HD")

Another way of producing the boxplots

boxplot(fram_data$SBP~fram_data$HD, ylab ="SBP", xlab = "HD")

Next we explore how proportions of HD = 1 relate to SBP

Standard plots:

plot(fram_data$SBP,
     fram_data$HD,
     col=fram_data$HD, 
     xlab = "SBP", 
     ylab = "HD")
legend("topright",
       legend=c("0", "1"),
       lty=c(1,1),
       lwd=c(2,2),
       col=unique(fram_data$HD))

Problems: many observations are over-plotted so they are not visible.

Solutions: use jitter() to spread out the observations with similar SBP values.

plot(jitter(as.numeric(fram_data$HD), factor=.5) ~ fram_data$SBP,
     pch=4,
     col=fram_data$HD,
     ylab="HD",
     xlab="SBP")
legend("topright", legend=c("0", "1"),
       lty=c(1,1), lwd=c(2,2), col=unique(fram_data$HD))

It’s still hard to tell the proportion of each category given SBP.

Alternatively we could use stripchart() as follows:

stripchart(fram_data$SBP~fram_data$HD, method="jitter", 
           col=c("black", "red"), pch=4,
           ylab="HD", xlab="SBP") 
legend("topright", legend=c("0", "1"),
       lty=c(1,1), lwd=c(2,2), col=unique(fram_data$HD))

All the plots above do not show the proportion of “1”'s vs. “0”'s as a function of SBP.

2. Logistic Regression of HD vs. SBP

Logit model

Analogous to liner model in a regular regression, we model

P ( Y = 1 ∣ S B P ) = e β 0 + β 1 S B P 1 + e β 0 + β 1 S B P P(Y=1\vert SBP) = \frac{e^{\beta_0 + \beta_1 SBP}}{1+e^{\beta_0+\beta_1 SBP}} P(Y=1SBP)=1+eβ0+β1SBPeβ0+β1SBP
where β 0 \beta_0 β0 and β 1 \beta_1 β1 are unknown parameters.

Immediately we have

i)

P ( Y = 0 ∣ S B P ) = 1 − P ( Y = 1 ∣ S B P ) = 1 1 + e β 0 + β 1 S B P P(Y=0\vert SBP) = 1-P(Y=1\vert SBP) = \frac{1}{1+e^{\beta_0+\beta_1 SBP}} P(Y=0SBP)=1P(Y=1SBP)=1+eβ0+β1SBP1

ii)

log ⁡ ( P ( Y = 1 ∣ S B P ) P ( Y = 0 ∣ S B P ) ) = β 0 + β 1 × S B P \log\left(\frac{P(Y=1\vert SBP)}{P(Y=0\vert SBP)}\right)=\beta_0+\beta_1 \times SBP log(P(Y=0SBP)P(Y=1SBP))=β0+β1×SBP
The above function is also termed as logit(P(Y=1|SBP))

iii) The interpretation of β 1 \beta_1 β1 is the change in log odds ratio for a unit change in SBP.

iv) Prob(Y=1|SBP) is a monotone function of SBP.

Maximum likelihood methods

To estimate the unknown parameters β \beta β's we introduce Likelihood Function of β 0 \beta_0 β0 and β 1 \beta_1 β1 given the data, namely the

Probability of seeing the actual outcome in the data:

Take a look at a piece of the data, randomly chosen from our dataset. We can then see part of the likelihood function.

set.seed(2)
fram_data[sample(1:1406, 10), c("HD", "SBP")]

We use this to illustrate the notion of likelihood function.

[\begin{split}
\mathcal{Lik}(\beta_0, \beta_1 \vert {\text Data}) &= {\text {Prob(the outcome of the data)}}\
&=Prob((Y=0|SBP=130), (Y=0|SBP=140), \ldots, (Y=1|SBP=130), \ldots ) \
&=Prob(Y=0|SBP=130) \times Prob(Y=0|SBP=140) \times, \ldots, Prob(Y=1|SBP=130), \ldots ) \
&= \frac{1}{1+e^{\beta_0 + 130 \beta_1}}\cdot\frac{1}{1+e^{\beta_0 + 140\beta_1}}\cdots\frac{e^{\beta_0 + 130 \beta_1}}{1 + e^{\beta_0 + 130 \beta_1}} \dots
\end{split}]

MLE: The maximizes β ^ 1 \hat \beta_1 β^1 and β ^ 0 \hat \beta_0 β^0 will be used to estimate the unknown parameters, termed as Maximum Likelihood Estimators:

max ⁡ β 0 , β 1 L i k ( β 0 , β 1 ∣ D a t a ) \max _{\beta_0, \beta_1} \mathcal{Lik}(\beta_0, \beta_1 \vert Data) β0,β1maxLik(β0,β1Data)
Remark:

  • β ^ 0 \hat \beta_0 β^0 and β ^ 1 \hat \beta_1 β^1 are obtained through
    max ⁡ log ⁡ ( L i k ( β 0 , β 1 ∣ D ) ) \max\log(\mathcal{Lik}(\beta_0, \beta_1 \vert D)) maxlog(Lik(β0,β1D))

  • MLE can only be obtained through numerical calculations.

  • glm() will be used to do logistic regression.
    It is very similar to lm() but some output might be different.

The default is logit link in glm

fit1 <- glm(HD~SBP, fram_data, family=binomial(logit)) 
summary(fit1, results=TRUE)

To see the prob function estimated by glm:

  • logit = -3.66 + 0.0159 SBP

  • P ( H D = 1 ∣ S B P ) = e − 3.66 + 0.0159 × S B P 1 + e − 3.66 + 0.0159 × S B P P(HD = 1 \vert SBP) = \frac{e^{-3.66+0.0159 \times SBP}}{1+e^{-3.66+0.0159 \times SBP}} P(HD=1SBP)=1+e3.66+0.0159×SBPe3.66+0.0159×SBP

  • P ( H D = 0 ∣ S B P ) = 1 1 + e − 3.66 + 0.0159 × S B P P(HD = 0 \vert SBP) = \frac{1}{1+e^{-3.66+0.0159 \times SBP}} P(HD=0SBP)=1+e3.66+0.0159×SBP1

  • Notice the prob of Y=1 is an increasing function of SBP since β ^ 1 = . 0159 > 0 \hat \beta_1 = .0159 > 0 β^1=.0159>0.

  • Unfortunately we do not have a nice interpretation of β 1 \beta_1 β1 over Prob (Y=1).

par(mfrow=c(1,1))
plot(fram_data$SBP, fit1$fitted.values, pch=16, 
     xlab="SBP",
     ylab="Prob of P(Y=1|SBP)")

Alternatively, we can plot the prob through e − 3.66 + 0.0159 × S B P 1 + e − 3.66 + 0.0159 × S B P \frac{e^{-3.66+0.0159 \times SBP}}{1+e^{-3.66+0.0159 \times SBP}} 1+e3.66+0.0159×SBPe3.66+0.0159×SBP

x <- seq(100, 300, by=1)
y <- exp(-3.66+0.0159*x)/(1+exp(-3.66+0.0159*x))
plot(x, y, pch=16, type = "l",
     xlab = "SBP",
     ylab = "Prob of P(Y=1|SBP)" )

Inference for the Coefficients

i. Wald intervals/tests (through the MLE’s)

Facts about MLE:

  • The MLE’s are approximately normal and they are unbiased estimators of the β \beta β's.
  • The standard errors are obtained through information matrix
  • The z z z intervals and z z z tests are valid for each β i \beta_i βi.
summary(fit1)

Usual z-intervals for the coefficient’s (output from the summary)

confint.default(fit1)   
ii. Likelihood Ratio Tests

Similar to F tests in OLS (Ordinary Least Squares), we have Chi-Square tests to test if a collective set of variables are not needed.

Here:

H 0 : β S B P = 0 H_0: \beta_{SBP}=0 H0:βSBP=0
H 1 : β S B P ≠ 0 H_1: \beta_{SBP} \not= 0 H1:βSBP=0

Facts about MLE: Likelihood Ratio Tests

  • Under the H 0 H_0 H0, the likelihood ratio (modified with log)
    [\begin{split}
    \text {Testing stat} = \chi^2
    & = -2\times \log \frac{\max {H_1} \mathcal{Lik}(\beta_0, \beta_1 \vert D)}{\max{H_0} \mathcal{Lik}(\beta_0, \beta_1 \vert D)}\
    &=-2\log(\mathcal{Lik}{H_0}) - (-2\log(\mathcal{Lik}{H_1}))\
    &\sim \chi^2_{df=1}
    \end{split}]

  • Testing stat = Null Deviance - Residual Deviance. Here Deviance = − 2 log ⁡ ( L ) -2\log(\mathcal{L}) 2log(L)

  • The p-value is done through χ 2 \chi^2 χ2 distribution
    p v a l u e = P ( χ d f 2 > χ 2 ) p_{value}=P(\chi^2_{df} > \chi^2) pvalue=P(χdf2>χ2)

  • glm() outputs the two terms.

summary(fit1)  

Notice:

  • Null deviance = 1485.9

  • Residual deviance: 1432.8

  • χ 2 = 1485.9 − 1432.8 = 53.1 \chi^2 = 1485.9-1432.8= 53.1 χ2=1485.91432.8=53.1

Get the χ 2 \chi^2 χ2 statistic

chi.sq <- 1485.9-1432.8

p v a l u e p_{value} pvalue: from the likelihood Ratio test

pchisq(chi.sq, 1, lower.tail=FALSE)

Or use anova/Anova{car} to get the χ 2 \chi ^2 χ2 tests. Here the null hypothesis is that SBP is not needed or all (but the intercept) coefficients are 0

anova(fit1, test="Chisq") # 

Similar to the F-test in lm() set up.

Anova(fit1)   

Inverting the likelihood ratio tests we can get confidence intervals for the coefficients. This should be similar to that obtained through Wald or z intervals but they are not the same.

confint(fit1) 
iii. Prediction

To do a prediction based on fit1 we plug in SBP value into the prob equation.

fit1.predict <- predict(fit1, fram_data.new, type="response")
fit1.predict

3. Multiple Logistic Regression

To avoid problems, we will delete all cases with missing values

fram_data.f <- na.omit(fram_data)

a. With all the predictors

i. Full model

l o g i t = β 0 + β 1 x 1 + ⋯ + β p x p logit = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p logit=β0+β1x1++βpxp
P ( H D = 1 ) = e β 0 + β 1 x 1 + ⋯ + β p x p 1 + e β 0 + β 1 x 1 + … β p x p P(HD = 1) = \frac{e^{\beta_0 + \beta_1 x_1 + \dots + \beta_p x_p}}{1 + e^{\beta_0 + \beta_1 x_1 + \dots \beta_p x_p}} P(HD=1)=1+eβ0+β1x1+βpxpeβ0+β1x1++βpxp

  • MLE’s will be obtained through maximize the Lik function

  • Wald tests/intervals hold for each coefficient

  • Likelihood Ratio tests hold for a set of predictors

With all the predictors

fit2 <- glm(HD~., fram_data.f, family=binomial)
summary(fit2)
  • The prob of Y=1 given all the factors is estimated as
    P ( H D = 1 ) = e − 9.33480 + 0.06249 × A G E + 0.90610 I M a l e + ⋯ + 0.01231 × C I G 1 + e − 9.33480 + 0.06249 × A G E + 0.90610 I M a l e + ⋯ + 0.01231 × C I G P(HD = 1) = \frac{e^{-9.33480 + 0.06249 \times AGE + 0.90610 I_{Male} + \dots + 0.01231 \times CIG}}{1 + {e^{-9.33480 + 0.06249 \times AGE + 0.90610 I_{Male} + \dots + 0.01231 \times CIG}}} P(HD=1)=1+e9.33480+0.06249×AGE+0.90610IMale++0.01231×CIGe9.33480+0.06249×AGE+0.90610IMale++0.01231×CIG

  • Are all the risk factors not useful? Or all the β \beta β's are 0 (Other than the intercept)

chi.sq <- 1469-1343
chi.sq  #126
pvalue <- pchisq(chi.sq, 7, lower.tail=FALSE)
pvalue

We reject the null hypothesis with any reasonable small α \alpha α.

Alternatively we may use anova to get the χ 2 \chi^2 χ2 test by fitting the reduced model vs. the full model:

fit0 <- glm(HD~1, fram_data.f, family=binomial) # null 

It works only if two fits use same samples.

anova(fit0, fit2, test="Chisq") # 
ii. Are CIG and FRW not needed?

Use Likelihood Ratio test to see if CIG and FRW are not needed in fit2

fit3 <- update(fit2, .~. -CIG -FRW)
summary(fit3)

We first calculate the Chi-square stat directly:

chi.sq.2 <- fit3$deviance-fit2$deviance
chi.sq.2
pchisq(chi.sq.2, 2, lower.tail=FALSE) # 0.06194729

We fail to reject the H 0 H_0 H0 at p < 0.05 p < 0.05 p<0.05 but we do reject the null at a l p h a = . 1 alpha = .1 alpha=.1 since the p-value is < .1.

We could use anova…

anova(fit3, fit2, test="Chisq")    

Once again we don’t have evidence to reject the null at 0.05.

b. Model selection

i. Backward elimination

Backward selection is the easiest (without any packages) if p is not too large. As in lm, we eliminate the variable with the highest p-value.

summary(fit2)
fit2.1 <- update(fit2, .~. -DBP)

Backward selection by kicking DBP (the one with largest p-value) out

summary(fit2.1)
fit2.2 <- update(fit2.1, .~. -FRW)
summary(fit2.2)

CIG didn’t really add much to decrease the residual deviance

fit2.3 <- update(fit2.2, .~. -CIG)
summary(fit2.3)

Prediction for the subject using fit2.3

fit2.3.predict <- predict(fit2.3, fram_data.new, type="response")
fit2.3.predict

Notice the difference between the predictions among fit2.3 and fit1!!!

predict(fit1, fram_data.new, type="response") # 
ii. AIC through bestglm()

A I C = − 2 log ⁡ ( L ) + 2 d AIC = - 2\log(\mathcal{L}) + 2d AIC=2log(L)+2d
Where d d d: total number of the parameters. We are looking for a model with small AIC.

All predictors

fit2 <- glm(HD~., fram_data.f, family=binomial)
summary(fit2)

In fit2:
Residual Deviance= − 2 log ⁡ ( L ) -2\log(\mathcal{L}) 2log(L), so
AIC = − 2 log ⁡ ( L ) + 2 d -2\log(\mathcal{L}) + 2d 2log(L)+2d
= Residual deviance + 2*8
= 1343.1 + 16 = 1359.1

Which model(s) gives us the smallest AIC?

bestglm() will be used

  • the syntax is similar to that from leaps.
  • we know less well about bestglm comparing with that of regsubsets

Get the design matrix without 1’s and HD.

Xy <- model.matrix(HD ~.+0, fram_data.f) 

Attach y as the last column.

Xy <- data.frame(Xy, fram_data.f$HD)   
str(Xy)
fit.all <- bestglm(Xy, family = binomial, method = "exhaustive", IC="AIC", nvmax = 10) # method = "exhaustive", "forward" or "backward"
names(fit.all) # fit.all$Subsets to list best submodels

List the top 5 models. In the way any one of the following model could be used.

fit.all$BestModels  

Fit the best model which is the same as

summary(fit.all$BestModel)
summary(glm(HD~AGE+SEX+SBP+CHOL+FRW+CIG, family=binomial, data=fram_data.f))
iii. More models
  • Extend models to include iteractions

Are there interaction effects of SBP by SEX? or AGE by SEX?

Anova(glm(HD ~ AGE + SBP + SEX + SBP*SEX + AGE * SEX, fram_data, family = binomial))

#Linear Discriminate Analysis

Case Study I: Riding Movers

A riding-mower manufacturer would like to find a way of classifying families in a city into those likely to purchase a riding mower and those not likely to buy one. A pilot random sample is undertaken of 12 owners and 12 nonowners in the city.

mower <- read.csv('RidingMowers.csv')
str(mower)
names(mower)
summary(mower)

EDA

Make a scatter plot for this dataset.

We can think of a linear classification rule as a line that separates the two-dimensional region into two parts, with most of the owners in one half-plane and most nonowners in the complementary half-plane.

  ggplot(mower) +
  geom_point(aes(x = Income, y = Lot_Size, col = Ownership), size = 3.5) +
  xlab("Income($000s)") + 
  ylab("Lot Size(000s sqft)")

LDA

Classfication Function

We use DiscriMiner to do LDA, which gives us Linear Discriminant Analysis function.

It will output the LDA model with classification error rate, confusion table.

da.reg1 <- linDA(mower[,1:2], mower[,3])
#da.reg1 <- linDA(mower[,1:2], mower[,3],prior = c(1/2,1/2))
names(da.reg1)
da.reg1$functions

To classify a family into the class of owners or nonowners, we use the classification functions to compute the family’s classification scores.

A family is classified into the class of owners if the owner function score is higher than the nonowner function score, and into nonowners if the reverse is the case.

c ^ ( N o n n o w e r ∣ I n c o m e , L o t _ S i z e ) = − 51.42 + 0.3294 ∗ I n c o m e + 4.682 ∗ L o t _ S i z e c ^ ( O w n e r ∣ I n c o m e , L o t _ S i z e ) = − 73.16 + 0.4296 ∗ I n c o m e + 5.467 ∗ L o t _ S i z e \begin{aligned} \hat{c}(Nonnower|Income,Lot\_Size) & = -51.42+0.3294*Income+4.682*Lot\_Size\\ \hat{c}(Owner|Income,Lot\_Size) & = -73.16+0.4296*Income+5.467*Lot\_Size \end{aligned} c^(NonnowerIncome,Lot_Size)c^(OwnerIncome,Lot_Size)=51.42+0.3294Income+4.682Lot_Size=73.16+0.4296Income+5.467Lot_Size

An alternative way for classifying a record into one of the classes is to compute the probability of belonging to each of the classes and assigning the record to the most likely class.

P ( N o n o w n e r ∣ I n c o m e , L o t _ S i z e ) = c ^ ( N o n o w n e r ∣ I n c o m e , L o t _ S i z e ) c ^ ( O w n e r ∣ I n c o m e , L o t _ S i z e ) + c ^ ( N o n o w e n e r ∣ I n c o m e , L o t _ S i z e ) P ( O w n e r ∣ I n c o m e , L o t _ S i z e ) = c ^ ( O w n e r ∣ I n c o m e , L o t _ S i z e ) c ^ ( O w n e r ∣ I n c o m e , L o t _ S i z e ) + c ^ ( N o n o w e n e r ∣ I n c o m e , L o t _ S i z e ) \begin{aligned} P(Nonowner|Income,Lot\_Size)&=\frac{\hat{c}(Nonowner|Income,Lot\_Size)} {\hat{c}(Owner|Income,Lot\_Size)+\hat{c}(Nonowener|Income,Lot\_Size)}\\ P(Owner|Income,Lot\_Size)&=\frac{\hat{c}(Owner|Income,Lot\_Size)} {\hat{c}(Owner|Income,Lot\_Size)+\hat{c}(Nonowener|Income,Lot\_Size)} \end{aligned} P(NonownerIncome,Lot_Size)P(OwnerIncome,Lot_Size)=c^(OwnerIncome,Lot_Size)+c^(NonowenerIncome,Lot_Size)c^(NonownerIncome,Lot_Size)=c^(OwnerIncome,Lot_Size)+c^(NonowenerIncome,Lot_Size)c^(OwnerIncome,Lot_Size)

propensity.owner <- exp(da.reg1$scores[,2])/(exp(da.reg1$scores[,1])+exp(da.reg1$scores[,2]))
output1 <- data.frame(Actual=mower$Ownership,
           Pred=da.reg1$classification, 
           da.reg1$scores, 
           propensity.owner=propensity.owner)
output1
Classification Boundary

We could also caculate the classification boundary.

c ^ ( N o n o w n e r ∣ I n c o m e , L o t _ S i z e ) = c ^ ( O w n e r ∣ I n c o m e , L o t _ S i z e ) − 51.42 + 0.3294 ∗ I n c o m e + 4.682 ∗ L o t _ S i z e = − 73.16 + 0.4296 ∗ I n c o m e + 5.467 ∗ L o t _ S i z e 0.1002 ∗ I n c o m e + 0.785 ∗ L o t _ S i z e = 21.74 L o t _ S i z e = 27.69 − 0.1276 ∗ I n c o m e \begin{aligned} \hat{c}(Nonowner|Income,Lot\_Size)&=\hat{c}(Owner|Income,Lot\_Size)\\ -51.42+0.3294*Income+4.682*Lot\_Size&= -73.16+0.4296*Income+5.467*Lot\_Size \\ 0.1002*Income+0.785*Lot\_Size &= 21.74\\ Lot\_Size &= 27.69 - 0.1276*Income \end{aligned} c^(NonownerIncome,Lot_Size)51.42+0.3294Income+4.682Lot_Size0.1002Income+0.785Lot_SizeLot_Size=c^(OwnerIncome,Lot_Size)=73.16+0.4296Income+5.467Lot_Size=21.74=27.690.1276Income

ggplot(mower) +
  geom_point(aes(x = Income, y = Lot_Size, col = Ownership), size = 3.5) +
  geom_abline(intercept = 27.69, slope = -0.1276, color = 'orange', linetype= 'dashed', size = 1.2) + 
  xlab("Income($000s)") + 
  ylab("Lot Size(000s sqft)")
Prediction

We use classify to do a prediction. For instance, the first household has an income of $60K and a lot size of 18.4K f t 2 ft^2 ft2.

newmower <- mower[1,]
newmower[1] <- 60
newmower[2] <- 18.4
newmower[3] <- 'NA'
newmower
pred1 <- classify(da.reg1,as.vector(newmower[1:2]))
pred1

c ^ ( N o n o w n e r ∣ I n c o m e , L o t _ S i z e ) = − 51.42 + 0.3294 ∗ 60 + 4.682 ∗ 18.4 = 54.48 c ^ ( O w n e r ∣ I n c o m e , L o t _ S i z e ) = − 73.16 + 0.4296 ∗ 60 + 5.467 ∗ 18.4 = 53.20 \begin{aligned} \hat{c}(Nonowner|Income,Lot\_Size) & = -51.42+0.3294*60+4.682*18.4=54.48\\ \hat{c}(Owner|Income,Lot\_Size) & = -73.16+0.4296*60+5.467*18.4=53.20 \end{aligned} c^(NonownerIncome,Lot_Size)c^(OwnerIncome,Lot_Size)=51.42+0.329460+4.68218.4=54.48=73.16+0.429660+5.46718.4=53.20
Below we discussion some concepts related to classification.

a. Sensitivity

P r o b ( Y ^ = 1 ∣ Y = 1 ) Prob(\hat Y = 1 \vert Y=1) Prob(Y^=1Y=1)

Not an error. This is also called True Positive Rate: the proportion of corrected positive classification given the status being positive.

b. Specificity

P r o b ( Y ^ = 0 ∣ Y = 0 ) Prob(\hat Y = 0| Y=0) Prob(Y^=0Y=0)

Specificity: the proportion of corrected negative classification given the status being negative.

c. False Positive

1 − S p e c i f i c i t y = P ( Y ^ = 1 ∣ Y = 0 ) 1 - Specificity = P(\hat Y=1 \vert Y=0) 1Specificity=P(Y^=1Y=0)

False Positive: the proportion of wrong classifications among given the status being negative.

d. Misclassification error

Mean value of missclassifications:

M C E = 1 n ∑ { y ^ i ≠ y i } . MCE= \frac{1}{n} \sum \{\hat y_i \neq y_i\}. MCE=n1{y^i=yi}.

We can get all these quantities through confusion matrix or directly find the misclassification errors.

f. Confusion Matrix

We could use table to to create a 2 by 2 table which summarizes the number of mis/agreed labels.

table(da.reg1$classification,mower$Ownership)

We could use $confusion to get the confusion matrix.

da.reg1$confusion

We could also use confusionMatrix function from caret packages to summarize the number of mis/agreed labels.

confusionMatrix(da.reg1$classification,mower$Ownership)

g. The Roc Curve and AUC

For each model or process, given a threshold, or a classifier, there will be a pair of sensitivity and specificity.

By changing the threshold, we graph all the pairs of False Positive as x-axis and True Positive as y-axis to have a curve: the ROC curve.

We use the roc function from the package pROC.

Notice that the ROC curve here is Sensitivity vs Specificity. Most of the ROC is drawn using False Positive rate as x-axis.

fit.roc1 <- roc(mower$Ownership, output1$propensity.owner, plot = T, col = 'blue')
fit.roc1$auc
#auc(fit.roc1)

False Positive vs Sensitivity curve is plotted in most ROC curves:

plot(1-fit.roc1$specificities, fit.roc1$sensitivities, col="red", pch=16,
     xlab="False Positive", 
     ylab="Sensitivity")

We can get more from fit.roc1. For example, a curve shows the probability thresholds used and the corresponding False Positive rate.

plot(fit.roc1$thresholds, 1-fit.roc1$specificities,  col="green", pch=16,  
     xlab="Threshold on prob",
     ylab="False Positive",
     main = "Thresholds vs. False Postive")

h. Positive Prediction

Positive Prediction is a measure of the accuracy given the predictions.

Positive Prediction = P ( P o s i t i v e ∣ C l a s s i f i e d   a s   P o s i t i v e ) P({\rm Positive \vert Classified\space as\space Positive}) P(PositiveClassified as Positive)

For da.reg1, recall the confusion matrix being

cm.1 <- table(da.reg1$classification, mower$Ownership)
cm.1
positive.pred <- cm.1[1, 1] / (cm.1[1, 1] + cm.1[1, 2])
positive.pred

i. Negative Prediction

Negative Prediction = P ( N e g a t i v e ∣ C l a s s i f i e d   a s   N e g a t i v e ) P({\rm Negative \vert Classified\space as\space Negative}) P(NegativeClassified as Negative)

negative.pred <- cm.1[2, 2] / (cm.1[2, 1] + cm.1[2, 2])
negative.pred

#Principal Component Analysis

Case study: Measurement of Intelligence

Goal: How people differ in intelligence?

  • Our data set IQ.csv is a subset of individuals from the 1979 National Longitudinal Study of Youth (NLSY79) survey who were re-interviewed in 2006.
  • Information about family, personal demographic such as gender, race and eduction level, plus a set of ASVAB (Armed Services Vocational Aptitude Battery) test scores.
  • It is STILL used as a screening test for those who want to join the US army!

The test has the following components:

  • Science, Arith (Arithmetic reasoning), Word (Word knowledge), Parag (Paragraph comprehension), Numer (Numerical operation), Coding (Coding speed), Auto (Automative and Shop information), Math (Math knowledge), Mechanic (Mechanic Comprehension) and Elec (Electronic information).
  • Lastly AFQT (Armed Forces Qualifying Test) is a combination of Word, Parag, Math and Arith.
  • Note: Service Branch requirement: Army 31, Navy 35, Marines 31, Air Force 36, and Coast Guard 45, (out of 100 which is the max!) My prediction is that all of us pass the requirements, even for Coast Guard:)

Our goal is to see how we can summarize the set of tests and grab main information about each one’s intelligence efficiently.

Note: One of the original study goals is to see how intelligence relates to one’s future successes measured by income in 2005.

0) Get a quick look at the data

data1 <- read.csv("IQ.Full.csv")
dim(data1)
names(data1)
summary(data1)

1) We first concentrate on the AFQT tests: Word, Math, Parag and Arith

For simplicity we take a subset of 50 subjects.

set.seed(1)
data.AFQT <- data1[sample(nrow(data1), 50, replace=FALSE), c("Word","Parag", "Math", "Arith")]
str(data.AFQT)
summary(data.AFQT)

We expect the four test scores correlated each other.

pairs(data.AFQT, pch=16, col="blue")   # shows pairwise relationship between two sets of scores

rownames(data.AFQT)  # label for each person
rownames(data.AFQT) <-  paste("p", seq(1:nrow(data.AFQT)), sep="")  # reassign everyone's labels to be shorter.
rownames(data.AFQT)

2) Scaling and centering variables

sapply(data.AFQT, sd)  # sd's for each test
sapply(data.AFQT, mean) # means for each test
colMeans((data.AFQT))
# What are the variances and correlations?
var(data.AFQT)  
cor(data.AFQT)

Each test score has its own variance. Sometimes the variables of interest may have different units.
We can center the mean to 0 and set the sd to 1 for each test by subtracting the score means and dividing the sd for each test. Then the cor and var will be the same.

data.AFQT.scale <- scale(data.AFQT, center=TRUE, scale=TRUE)  #default
is.matrix(data.AFQT.scale) # it turns a data frame to a matrix

Now the mean of each test will be 0, and the var matrix will be same as the cor matrix.

data.AFQT.scale <- as.data.frame(data.AFQT.scale)
colMeans(data.AFQT.scale)  # all zeros
sapply(data.AFQT.scale, sd)  # all 1's
# Now the var matrix and cor matrix are the same after scaled.
var(data.AFQT.scale)   
cor(data.AFQT.scale)
pairs(data.AFQT.scale, pch=16, col="blue")

3) Principal Components: dimension reduction

i) For simplicity let’s first look at one pair of the scores

# Parag and Word
par(mfrow=c(1,1))
plot(data.AFQT$Parag, data.AFQT$Word, pch=16,
     xlab="Parag",
     ylab="Word",
     xlim=c(0, 40), 
     ylim=c(0, 40))

Questions:

i) How can we use one score which combines both Parag and Word linearly, such that it will give us the largest variance?

ii) Can we find a line which is closest to all the points?

This is equivalent to find ϕ 11 \phi_{11} ϕ11 and ϕ 21 \phi_{21} ϕ21 such that
V a r ( Z 1 ) = V a r ( ϕ 11 ∗ X 1 + ϕ 21 ∗ X 2 ) Var(Z_1)= Var(\phi_{11}*X_1+ \phi_{21}*X_2) Var(Z1)=Var(ϕ11X1+ϕ21X2) is maximized with constraint ϕ 11 2 + ϕ 21 2 = 1 \phi_{11}^2 + \phi_{21}^2 = 1 ϕ112+ϕ212=1

Here X1 and X2 are centered and scaled Parag and Word scores.
X 1 = ( P a r g − m e a n ( P a r a g ) ) / s d ( P a r a g ) X_1=(Parg - mean(Parag))/sd(Parag) X1=(Pargmean(Parag))/sd(Parag)
X 2 = ( W o r d − m e a n ( W o r d ) ) / s d ( W o r d ) X_2=(Word - mean(Word))/sd(Word) X2=(Wordmean(Word))/sd(Word)

attach(data.AFQT)
mean(Parag) #11.16
sd(Parag)   # 3.19
mean(Word)  #26.64
sd(Word)    #7.25

X1=(Parag - mean(Parag))/sd(Parag) 
X2=(Word - mean(Word))/sd(Word)
# Same as
scale(data.AFQT[, c("Parag", "Word") ], 
      center=TRUE, scale=TRUE) # By default center=TRUE and scale=TRUE

plot(X1, X2, pch=16, 
     xlab="Parag centered and scaled",
     ylab="Word centered and scaled" )
abline(h=0, v=0, lwd=4)   # better make the x y lim to be the same


plot(X1, X2, pch=16, xlim=c(-3, 3), ylim=c(-3, 3),
     xlab="Parag centered and scaled",
     ylab="Word centered and scaled" )
abline(h=0, v=0, lwd=4)
par(mfrow=c(1,1))     

plot(X1, X2, pch=16, xlim=c(-3, 3), ylim=c(-3, 3),
     xlab="Parag centered and scaled",
     ylab="Word centered and scaled" )
abline(0, 1, lwd=4, col="red")  
abline(0, -1, lwd=4, col="blue") 

ii) Terminology

Z 1 Z_1 Z1: First principle component.

( ϕ 11 , ϕ 21 ) (\phi_{11}, \phi_{21}) (ϕ11,ϕ21) is called the loadings.

The entire Z 1 Z_1 Z1, one for each person, is called PC scores.

# Here we go: PCA
pc.parag.word <- prcomp(data.AFQT[, c("Parag", "Word")], scale=TRUE)
names(pc.parag.word)
pc.parag.word$rotation # Loadings = phi's
phi_11 <- 0.7071068
phi_21 <- 0.7071068
pc.parag.word$x  # PC scores: the first column = Z1

Z1 <- phi_11 * X1 + phi_21 * X2  # PC scores. The two scores should be the same possibly by a different sign
Z1.1 <- Z1
max(abs(pc.parag.word$x[, 1]-Z1.1))   # Z1 and pc.parag.word$x[, 1] are the same.

pc.parag.word$sdev  # sd(Z1)
pc.parag.word$center  # means of the original scores

iii) Second Principal Component

Similar to the first Principal Component, we are now looking for ϕ 12 , ϕ 22 \phi_{12}, \phi_{22} ϕ12,ϕ22, such that
Z 2 = ϕ 12 ∗ X 1 + ϕ 22 ∗ X 2 Z_2=\phi_{12} * X1 + \phi_{22} * X2 Z2=ϕ12X1+ϕ22X2
where

V a r ( Z 2 ) = V a r ( ϕ 12 ∗ X 1 + ϕ 22 ∗ X 2 ) Var(Z_2) = Var(\phi_{12} * X1 + \phi_{22} * X2) Var(Z2)=Var(ϕ12X1+ϕ22X2)

is maximized subject to constraints ϕ 12 2 + ϕ 22 2 = 1 \phi_{12}^2+\phi_{22}^2=1 ϕ122+ϕ222=1, and Z 2 Z_2 Z2 and Z 1 Z_1 Z1 are orthogonal.

Z 2 Z_2 Z2: Second principle component (PC2)

# The loadings
pc.parag.word$rotation # Loadings
Z2 <- pc.parag.word$x[,2]
# Let us verify
plot(Z1, Z2, pch=16,
     xlab="First Principal Component Z1",
     ylab="Second Principal Component Z2", 
     xlim=c(-4, 4),
     ylim=c(-4, 4))
abline(v=0, h=0, lwd=3, col="red")
# The sd's 
pc.parag.word$sdev  # sd's of pc scores
  1. The principal plot is a rotation for the original X1, X2 plot
  2. v a r ( Z 1 ) = 1.74986 > v a r ( Z 2 ) = 0.2501404 var(Z_1)=1.74986 > var(Z_2)=0.2501404 var(Z1)=1.74986>var(Z2)=0.2501404
  3. Z 1 Z_1 Z1 and Z 2 Z_2 Z2 are orthogonal
cor(Z1, Z2) # is 0

Principal Component of the four tests: Word, Math, Parag and Arith

Question:

  1. How can we best capture the performance based on the four tests?
  2. Can we come up with some sensible scores combining the four tests together?

PCs:
Looking for a linear transformation of X1=Word, X2=Parag, X3=Math, and X4=Arith to have the max variance.

First Principal Component is

Z 1 = ϕ 11 ∗ X 1 + ϕ 21 ∗ X 2 + ϕ 31 ∗ X 3 + ϕ 41 ∗ X 4 Z_1=\phi_{11}*X_1 + \phi_{21}*X_2 + \phi_{31}*X_3 + \phi_{41}*X_4 Z1=ϕ11X1+ϕ21X2+ϕ31X3+ϕ41X4

such that V a r ( Z 1 ) Var(Z_1) Var(Z1) is maximized with ∑ ϕ i 1 2 = 1 \sum \phi_{i1}^2 = 1 ϕi12=1.

Second Principal Component is

Z 2 = ϕ 12 ∗ X 1 + ϕ 22 ∗ X 2 + ϕ 32 ∗ X 3 + ϕ 42 ∗ X 4 Z_2=\phi_{12}*X_1 + \phi_{22}*X_2 + \phi_{32}*X_3 + \phi_{42}*X_4 Z2=ϕ12X1+ϕ22X2+ϕ32X3+ϕ42X4
such that V a r ( Z 2 ) Var(Z_2) Var(Z2) is maximized, ∑ ϕ i 2 2 = 1 \sum \phi_{i2}^2=1 ϕi22=1, and Z 2 Z_2 Z2 and Z 1 Z_1 Z1 are uncorrelated (Orthogonal). This is the same as the { ϕ i , 1 } \{\phi_{i,1}\} {ϕi,1} and { ϕ i , 2 } \{\phi_{i,2}\} {ϕi,2}s are orthogonal.

We keep going to obtain Z3, and Z4

  1. Scale and Center each score

    To find sensible PC’s, we recommend to

    • center each variable by subtracting its mean.
    • scale each variable by dividing its sd.
    • Two things can be achieved simultaneously by scale()
    • prcomp() has an option to scale or not

    Ready to get PC’s for the data set
    names(data.AFQT) # four tests
    summary(data.AFQT) # unscaled yet

data.AFQT.scale <- scale(data.AFQT, center=TRUE, scale = TRUE)
summary(data.AFQT.scale)
data.AFQT.scale <- as.data.frame(data.AFQT.scale)  # set it back as a data frame
names(data.AFQT.scale)
attach(data.AFQT.scale)
  1. Use prcomp()

We input the original variables. BUT set scale=TRUE

pc.4 <- prcomp(data.AFQT, scale=TRUE)  # by default, center=True but scale=FALSE!!!
names(pc.4)
summary(pc.4)
var(pc.4$x[, 1])

# Loadings (directions)
round(pc.4$rotation, 5) 
# PC1 scores: 
phi1 <- pc.4$rotation[,1]   # PC1 loadings, unique up to the sign 
Z1.1 <- phi1[1]*Word+phi1[2]*Parag+phi1[3]*Math+phi1[4]*Arith
# Essentially it says that we should take the sum of the four test scores (after scaled)
# This is same as 
Z1 <- pc.4$x[, 1]
max(abs(Z1.1-Z1))  # To convince you that two principal scores are the same.

# PC2 scores
Z2 <- pc.4$x[,2]

Interpretations of the pc scores:

plot(pc.4$x[, 1], pc.4$x[,2 ], pch=16,
     xlim=c(-4, 4),
     ylim=c(-4, 4),
     main="The leading two principal components",
     xlab="Z1=PC1", 
     ylab="Z2=PC2"
     )
abline(h=0, v=0)

Z1: total scores of the 4 tests

Z2: sum of the word and parg - sum of the math’s

Remark:

i) Two scores Z1 and Z2 capture the main features of the four tests

ii) How much information we might have lost by using only the two PC scores?

NOT MUCH…

Properties of the pc scores:

i) var(Z1) > var(Z2) …

sd(Z1)
sd(Z2)   # they are reported in 
pc.4$sdev  # standard dev's of all PC's in a decreasing order

ii) All 4 pc sores are uncorrelated.

cor(pc.4$x)   # cor's are 0
pairs(pc.4$x, xlim=c(-4, 4), ylim=c(-4, 4), col=rainbow(6), pch=16)

They are all pairwise uncorrelated!

iii) Proportion of variance explained (PVE)

pve.4 <- 100* (pc.4$sdev)^2/sum ((pc.4$sdev)^2)
plot(pve.4, pch=16, 
     xlab="Principal Components",
     ylab="Prop. of variance explained")

The leading component explains pve.4[1]=75% of the total variance.

Cumulative proportion of variance explained keeps track of the PVE including the first 1 PC, the first 2 PC’s, and so on.

cpve.4 <- 100*cumsum((pc.4$sdev)^2)/4   # cumulative proportions of the variance explained by 
  # the number of pc's used

plot(seq(1:4), cpve.4, pch=16, ylim=c(0, 100),
     main="Cumulative Proportion of Variance Explained",
     xlab="Number of PC's Used")

names(summary(pc.4))
summary(pc.4)$importance

# Scree plot of CPVE's
plot(summary(pc.4)$importance[3, ], pch=16,
     ylab="Cumulative PVE",
     xlab="Number of PC's",
     main="Scree Plot of PCA for AFQT")

plot(pc.4) # variances of each pc
screeplot(pc.4) # var's each pc's

All the information is stored in summary(pc.4)

iv) biplot: visualize the PC scores together with the loadings of the original variables

lim <- c(-.4, .4) 
biplot(pc.4, xlim=lim,
       ylim=lim,
       main="Biplot of the PC's")
abline(v=0, h=0)
# x-axis: PC1=Z1 (prop)
# y-axis: PC2=Z2
# top x-axis: prop to loadings for PC1
# right y-axis: prop to loadings for PC2

Summary:

  1. To capture the main features of AFQT four scores we could use two summaries.

    . Total scores (weighted)

    . Difference of the math+arith and word+parag

Exercise: Perform PCA of all 10 tests

Does Gender play any roll in the test scores?

Part II: Calculation of the PC’s

PCs are nothing but eigen vectors/values of COR(X1,X2,… Xp) (if scaled) or COV(X1,X2) (unscaled)
PCA are eigenvectors of Cor matrix

PC.eig <- eigen(cor(data.AFQT))
PC.eig
summary(PC.eig)
PC.eig$vectors   # Loadings
PC.eig$values    # Variances of each PC's 
# We use prcomp() here
PC <- prcomp(data.AFQT, scale=TRUE)
PC  # should be exactly same as PC's from eigen decomposition (up to the sign)
phi <- PC$rotation
phi
cbind(PC.eig$vectors, PC$rotation) # Putting the first PC's together
# one from eigen-values???the other from prcomp(). They should be exactly the same (to the sign) and 
# they are the same.

phi as we expected is an orthogonal unit matrix, i.e., the columns are uncorrelated with norm to be 1. Also,

i n v ( p h i ) = t ( p h i ) ! inv(phi)=t(phi)! inv(phi)=t(phi)!

This can be seen from

phi %*% t(phi)   # matrix operator %*%. It should be an identity matrix
round(phi %*% t(phi), 2) # better seen if we round it.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值