The caret R package
http://caret.r-forge.r-project.org/
Caret functionality
- Some preprocessing (cleaning)
- preProcess
- Data splitting
- createDataPartition
- createResample
- createTimeSlices
- Training/testing functions
- train
- predict
- Model comparison
- confusionMatrix
Machine learning algorithms in R
- Linear discriminant analysis
- Regression
- Naive Bayes
- Support vector machines
- Classification and regression trees
- Random forests
- Boosting
- etc.
Why caret?
http://www.edii.uclm.es/~useR-2013/Tutorials/kuhn/user_caret_2up.pdf
SPAM Example: Data splitting
[1] 3451 58
SPAM Example: Fit a model
Generalized Linear Model
3451 samples
57 predictors
2 classes: 'nonspam', 'spam'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ...
Resampling results
Accuracy Kappa Accuracy SD Kappa SD
0.9 0.8 0.02 0.04
SPAM Example: Final model
Call: NULL
Coefficients:
(Intercept) make address all num3d
-1.78e+00 -7.76e-01 -1.39e-01 3.68e-02 1.94e+00
our over remove internet order
7.61e-01 6.66e-01 2.34e+00 5.94e-01 4.10e-01
mail receive will people report
4.08e-02 2.71e-01 -1.08e-01 -2.28e-01 -1.14e-01
addresses free business email you
2.16e+00 8.78e-01 6.49e-01 1.38e-01 6.91e-02
credit your font num000 money
8.00e-01 2.17e-01 2.17e-01 2.04e+00 1.95e+00
hp hpl george num650 lab
-1.82e+00 -9.17e-01 -7.50e+00 3.33e-01 -1.89e+00
labs telnet num857 data num415
-7.15e-02 -2.33e-01 1.15e+00 -7.58e-01 1.33e+00
num85 technology num1999 parts pm
-2.15e+00 1.27e+00 -6.55e-02 9.83e-01 -8.28e-01
direct cs meeting original project
-2.69e-01 -4.55e+01 -2.50e+00 -6.93e-01 -2.01e+00
re edu table conference charSemicolon
-5.54e-01 -1.94e+00 -1.32e+00 -3.83e+00 -1.21e+00
charRoundbracket charSquarebracket charExclamation charDollar charHash
-1.92e-01 -1.92e-01 5.27e-01 6.86e+00 2.11e+00
capitalAve capitalLong capitalTotal
3.60e-02 1.05e-02 8.13e-04
Degrees of Freedom: 3450 Total (i.e. Null); 3393 Residual
Null Deviance: 4630
Residual Deviance: 1260 AIC: 1370
SPAM Example: Prediction
[1] spam spam spam nonspam nonspam nonspam spam spam spam spam spam
[12] spam spam spam spam spam spam spam nonspam spam spam spam
[23] nonspam spam nonspam nonspam spam spam spam spam spam spam spam
[34] spam spam spam spam spam spam spam spam spam spam spam
[45] spam spam spam spam nonspam spam nonspam spam spam spam spam
[56] spam nonspam nonspam spam spam spam spam spam nonspam spam spam
[67] spam spam spam spam spam spam spam spam spam spam spam
[78] nonspam nonspam nonspam spam spam nonspam spam nonspam nonspam spam spam
[89] spam spam spam spam spam spam nonspam spam spam spam spam
[100] spam spam spam nonspam spam nonspam spam spam spam spam spam
[111] spam spam spam spam nonspam spam spam spam spam spam spam
[122] spam spam spam spam spam spam spam nonspam spam spam nonspam
[133] spam spam spam spam spam spam spam spam spam spam spam
[144] spam spam spam nonspam spam spam spam spam spam spam spam
[155] nonspam spam nonspam spam nonspam spam spam spam spam spam spam
[166] spam spam spam spam spam spam spam spam spam spam spam
[177] spam spam spam spam spam spam spam spam spam spam spam
[188] nonspam spam spam nonspam spam nonspam spam spam spam spam spam
[199] spam spam spam spam spam nonspam spam spam spam spam spam
[210] spam spam spam spam spam spam spam spam spam spam spam
[221] spam spam spam spam spam spam spam spam spam spam spam
[232] spam spam spam spam spam spam nonspam spam spam nonspam spam
[243] spam spam spam spam spam spam spam spam spam spam spam
[254] spam spam spam spam nonspam spam spam spam spam spam spam
[265] nonspam spam spam spam spam spam spam spam spam spam spam
[276] spam spam spam spam spam spam nonspam spam spam spam spam
[287] nonspam nonspam spam spam spam spam spam spam spam spam spam
[298] spam spam spam spam spam spam spam spam spam spam spam
[309] spam spam spam nonspam spam spam spam spam spam spam spam
[320] spam spam spam spam spam spam spam spam spam nonspam spam
[331] spam spam spam spam spam spam spam spam spam spam spam
[342] spam spam spam spam spam spam spam spam spam spam spam
[353] spam spam spam spam spam spam spam spam spam spam spam
[364] spam spam spam spam spam spam spam spam spam spam spam
[375] spam spam spam spam spam spam spam spam spam spam spam
[386] nonspam spam spam spam spam nonspam spam spam spam spam spam
[397] spam spam spam spam spam spam spam spam spam spam spam
[408] spam spam nonspam nonspam spam spam spam spam spam spam nonspam
[419] spam spam spam spam nonspam spam nonspam spam nonspam spam spam
[430] nonspam nonspam spam nonspam nonspam spam nonspam spam spam spam spam
[441] spam spam spam spam spam spam spam spam spam spam spam
[452] spam spam spam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[463] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[474] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[485] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[496] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam spam nonspam nonspam
[507] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[518] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[529] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[540] nonspam nonspam spam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[551] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[562] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam spam nonspam nonspam
[573] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[584] nonspam nonspam nonspam nonspam spam nonspam nonspam nonspam nonspam nonspam nonspam
[595] nonspam nonspam nonspam nonspam spam nonspam nonspam nonspam nonspam nonspam nonspam
[606] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[617] spam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[628] nonspam nonspam nonspam nonspam nonspam nonspam spam nonspam nonspam nonspam nonspam
[639] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[650] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam spam nonspam nonspam
[661] nonspam nonspam nonspam nonspam spam nonspam nonspam nonspam nonspam nonspam nonspam
[672] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[683] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[694] nonspam nonspam nonspam nonspam nonspam spam nonspam nonspam nonspam nonspam nonspam
[705] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[716] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[727] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[738] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[749] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[760] nonspam spam spam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[771] nonspam nonspam spam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[782] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[793] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[804] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[815] nonspam nonspam nonspam spam spam nonspam nonspam nonspam nonspam nonspam nonspam
[826] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[837] nonspam nonspam spam nonspam nonspam nonspam spam nonspam nonspam nonspam nonspam
[848] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[859] nonspam spam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[870] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[881] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[892] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[903] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[914] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[925] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[936] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[947] nonspam spam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[958] nonspam nonspam nonspam nonspam spam nonspam nonspam nonspam nonspam nonspam nonspam
[969] nonspam nonspam nonspam spam nonspam nonspam spam nonspam nonspam nonspam nonspam
[980] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[991] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[1002] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[1013] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[1024] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[1035] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[1046] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam spam
[1057] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam spam
[1068] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[1079] nonspam nonspam nonspam nonspam nonspam nonspam spam nonspam nonspam nonspam nonspam
[1090] nonspam nonspam nonspam nonspam nonspam spam spam nonspam nonspam nonspam nonspam
[1101] nonspam nonspam nonspam spam spam nonspam nonspam nonspam nonspam spam nonspam
[1112] nonspam nonspam nonspam nonspam nonspam spam nonspam nonspam nonspam nonspam nonspam
[1123] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[1134] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
[1145] nonspam nonspam nonspam nonspam nonspam nonspam
Levels: nonspam spam
SPAM Example: Confusion Matrix
Confusion Matrix and Statistics
Reference
Prediction nonspam spam
nonspam 665 54
spam 32 399
Accuracy : 0.925
95% CI : (0.908, 0.94)
No Information Rate : 0.606
P-Value [Acc > NIR] : <2e-16
Kappa : 0.842
Mcnemar's Test P-Value : 0.0235
Sensitivity : 0.954
Specificity : 0.881
Pos Pred Value : 0.925
Neg Pred Value : 0.926
Prevalence : 0.606
Detection Rate : 0.578
Detection Prevalence : 0.625
Balanced Accuracy : 0.917
'Positive' Class : nonspam
Further information
- Caret tutorials:
- A paper introducing the caret package
SPAM Example: Data splitting
[1] 3451 58
SPAM Example: K-fold
Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
4141 4140 4141 4142 4140 4142 4141 4141 4140 4141
[1] 1 2 3 4 5 6 7 8 9 10
SPAM Example: Return test
Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
460 461 460 459 461 459 460 460 461 460
[1] 24 27 32 40 41 43 55 58 63 68
SPAM Example: Resampling
Resample01 Resample02 Resample03 Resample04 Resample05 Resample06 Resample07 Resample08 Resample09
4601 4601 4601 4601 4601 4601 4601 4601 4601
Resample10
4601
[1] 1 2 3 3 3 5 5 7 8 12
SPAM Example: Time Slices
[1] "train" "test"
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
[1] 21 22 23 24 25 26 27 28 29 30
Further information
- Caret tutorials:
- A paper introducing the caret package
SPAM Example
Train options
function (x, y, method = "rf", preProcess = NULL, ..., weights = NULL,
metric = ifelse(is.factor(y), "Accuracy", "RMSE"), maximize = ifelse(metric ==
"RMSE", FALSE, TRUE), trControl = trainControl(), tuneGrid = NULL,
tuneLength = 3)
NULL
Metric options
Continous outcomes:
- RMSE = Root mean squared error
- RSquared = $R^2$ from regression models
Categorical outcomes:
- Accuracy = Fraction correct
- Kappa = A measure of concordance
trainControl
function (method = "boot", number = ifelse(method %in% c("cv",
"repeatedcv"), 10, 25), repeats = ifelse(method %in% c("cv",
"repeatedcv"), 1, number), p = 0.75, initialWindow = NULL,
horizon = 1, fixedWindow = TRUE, verboseIter = FALSE, returnData = TRUE,
returnResamp = "final", savePredictions = FALSE, classProbs = FALSE,
summaryFunction = defaultSummary, selectionFunction = "best",
custom = NULL, preProcOptions = list(thresh = 0.95, ICAcomp = 3,
k = 5), index = NULL, indexOut = NULL, timingSamps = 0,
predictionBounds = rep(FALSE, 2), seeds = NA, allowParallel = TRUE)
NULL
trainControl resampling
- method
- boot = bootstrapping
- boot632 = bootstrapping with adjustment
- cv = cross validation
- repeatedcv = repeated cross validation
- LOOCV = leave one out cross validation
- number
- For boot/cross validation
- Number of subsamples to take
- repeats
- Number of times to repeate subsampling
- If big this can slow things down
Setting the seed
- It is often useful to set an overall seed
- You can also set a seed for each resample
- Seeding each resample is useful for parallel fits
seed example
Generalized Linear Model
3451 samples
57 predictors
2 classes: 'nonspam', 'spam'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ...
Resampling results
Accuracy Kappa Accuracy SD Kappa SD
0.9 0.8 0.007 0.01
seed example
Generalized Linear Model
3451 samples
57 predictors
2 classes: 'nonspam', 'spam'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ...
Resampling results
Accuracy Kappa Accuracy SD Kappa SD
0.9 0.8 0.007 0.01
Further resources
Example: predicting wages
Image Credit http://www.cahs-media.org/the-high-cost-of-low-wages
Data from: ISLR package from the book: Introduction to statistical learning
Example: Wage data
year age sex maritl race
Min. :2003 Min. :18.0 1. Male :3000 1. Never Married: 648 1. White:2480
1st Qu.:2004 1st Qu.:33.8 2. Female: 0 2. Married :2074 2. Black: 293
Median :2006 Median :42.0 3. Widowed : 19 3. Asian: 190
Mean :2006 Mean :42.4 4. Divorced : 204 4. Other: 37
3rd Qu.:2008 3rd Qu.:51.0 5. Separated : 55
Max. :2009 Max. :80.0
education region jobclass health
1. < HS Grad :268 2. Middle Atlantic :3000 1. Industrial :1544 1. <=Good : 858
2. HS Grad :971 1. New England : 0 2. Information:1456 2. >=Very Good:2142
3. Some College :650 3. East North Central: 0
4. College Grad :685 4. West North Central: 0
5. Advanced Degree:426 5. South Atlantic : 0
6. East South Central: 0
(Other) : 0
health_ins logwage wage
1. Yes:2083 Min. :3.00 Min. : 20.1
2. No : 917 1st Qu.:4.45 1st Qu.: 85.4
Median :4.65 Median :104.9
Mean :4.65 Mean :111.7
3rd Qu.:4.86 3rd Qu.:128.7
Max. :5.76 Max. :318.3
Get training/test sets
[1] 898 12
Feature plot (caret package)
Qplot (ggplot2 package)
Qplot with color (ggplot2 package)
Add regression smoothers (ggplot2 package)
cut2, making factors (Hmisc package)
cutWage
[ 20.1, 91.7) [ 91.7,118.9) [118.9,318.3]
704 725 673
Boxplots with cut2
Boxplots with points overlayed
Tables
cutWage 1. Industrial 2. Information
[ 20.1, 91.7) 437 267
[ 91.7,118.9) 365 360
[118.9,318.3] 263 410
cutWage 1. Industrial 2. Information
[ 20.1, 91.7) 0.6207 0.3793
[ 91.7,118.9) 0.5034 0.4966
[118.9,318.3] 0.3908 0.6092
Density plots
Notes and further reading
- Make your plots only in the training set
- Don't use the test set for exploration!
- Things you should be looking for
- Imbalance in outcomes/predictors
- Outliers
- Groups of points not explained by a predictor
- Skewed variables
- ggplot2 tutorial
- caret visualizations
Why preprocess?
Why preprocess?
[1] 4.709
[1] 25.48
Standardizing
[1] 5.862e-18
[1] 1
Standardizing - test set这里的标准化简直就是经典啊。。。。
[1] 0.07579
[1] 1.79
Standardizing - preProcess function
[1] 5.862e-18
[1] 1
Standardizing - preProcess function
[1] 0.07579
[1] 1.79
Standardizing - preProcess argument
3451 samples
57 predictors
2 classes: 'nonspam', 'spam'
Pre-processing: centered, scaled
Resampling: Bootstrap (25 reps)
Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ...
Resampling results
Accuracy Kappa Accuracy SD Kappa SD
0.9 0.8 0.007 0.01
Standardizing - Box-Cox transforms
Standardizing - Imputing data
Standardizing - Imputing data
0% 25% 50% 75% 100%
-1.1324388 -0.0030842 -0.0015074 -0.0007467 0.2155789
0% 25% 50% 75% 100%
-0.9243043 -0.0125489 -0.0001968 0.0194524 0.2155789
0% 25% 50% 75% 100%
-1.1324388 -0.0030033 -0.0015115 -0.0007938 -0.0001968
Notes and further reading
- Training and test must be processed in the same way
- Test transformations will likely be imperfect
- Especially if the test/training sets collected at different times
- Careful when transforming factor variables!
- preprocessing with caret
Two levels of covariate creation
Level 1: From raw data to covariate
Level 2: Transforming tidy covariates
Level 1, Raw data -> covariates
- Depends heavily on application
- The balancing act is summarization vs. information loss
- Examples:
- Text files: frequency of words, frequency of phrases (Google ngrams), frequency of capital letters.
- Images: Edges, corners, blobs, ridges (computer vision feature detection)
- Webpages: Number and type of images, position of elements, colors, videos (A/B Testing)
- People: Height, weight, hair color, sex, country of origin.
- The more knowledge of the system you have the better the job you will do.
- When in doubt, err on the side of more features
- Can be automated, but use caution!
Level 2, Tidy covariates -> new covariates
- More necessary for some methods (regression, svms) than for others (classification trees).
- Should be done only on the training set
- The best approach is through exploratory analysis (plotting/tables)
- New covariates should be added to data frames
Load example data
Common covariates to add, dummy variables
Basic idea - convert factor variables to indicator variables
1. Industrial 2. Information
1090 1012
jobclass.1. Industrial jobclass.2. Information
231655 1 0
86582 0 1
11141 0 1
229379 1 0
86064 1 0
377879 1 0
Removing zero covariates
freqRatio percentUnique zeroVar nzv
year 1.029 0.33302 FALSE FALSE
age 1.122 2.80685 FALSE FALSE
sex 0.000 0.04757 TRUE TRUE
maritl 3.159 0.23787 FALSE FALSE
race 8.529 0.19029 FALSE FALSE
education 1.492 0.23787 FALSE FALSE
region 0.000 0.04757 TRUE TRUE
jobclass 1.077 0.09515 FALSE FALSE
health 2.452 0.09515 FALSE FALSE
health_ins 2.269 0.09515 FALSE FALSE
logwage 1.198 17.26927 FALSE FALSE
wage 1.185 18.07802 FALSE FALSE
Spline basis
1 2 3
[1,] 0.00000 0.0000000 0.000e+00
[2,] 0.23685 0.0253768 9.063e-04
[3,] 0.44309 0.2436978 4.468e-02
[4,] 0.43081 0.2910904 6.556e-02
[5,] 0.42617 0.1482327 1.719e-02
[6,] 0.41709 0.1331149 1.416e-02
Fitting curves with splines
Splines on the test set
1 2 3
[1,] 0.00000 0.0000000 0.000e+00
[2,] 0.23685 0.0253768 9.063e-04
[3,] 0.44309 0.2436978 4.468e-02
[4,] 0.43081 0.2910904 6.556e-02
[5,] 0.42617 0.1482327 1.719e-02
[6,] 0.41709 0.1331149 1.416e-02
[7,] 0.31823 0.0540390 3.059e-03
[8,] 0.36253 0.3866940 1.375e-01
[9,] 0.44436 0.2275981 3.886e-02
[10,] 0.20449 0.0179375 5.245e-04
[11,] 0.07768 0.3601465 5.566e-01
[12,] 0.13145 0.0066841 1.133e-04
[13,] 0.39290 0.1042387 9.218e-03
[14,] 0.26654 0.0339238 1.439e-03
[15,] 0.20449 0.0179375 5.245e-04
[16,] 0.29109 0.4308138 2.125e-01
[17,] 0.23685 0.0253768 9.063e-04
[18,] 0.43624 0.2755195 5.800e-02
Notes and further reading
- Level 1 feature creation (raw data to covariates)
- Science is key. Google "feature extraction for [data type]"
- Err on overcreation of features
- In some applications (images, voices) automated feature creation is possible/necessary
- Level 2 feature creation (covariates to new covariates)
- The function preProcess in caret will handle some preprocessing.
- Create new covariates if you think they will improve fit
- Use exploratory analysis on the training set for creating them
- Be careful about overfitting!
- preprocessing with caret
- If you want to fit spline models, use the gam method in the caret package which allows smoothing of multiple variables.
- More on feature creation/data tidying in the Obtaining Data course from the Data Science course track.
Correlated predictors
row col
num415 34 32
num857 32 34
Correlated predictors
[1] "num415" "num857"
Basic PCA idea
- We might not need every predictor
- A weighted combination of predictors might be better
- We should pick this combination to capture the "most information" possible
- Benefits
- Reduced number of predictors
- Reduced noise (due to averaging)
We could rotate the plot
$$ X = 0.71 \times {\rm num 415} + 0.71 \times {\rm num857}$$
$$ Y = 0.71 \times {\rm num 415} - 0.71 \times {\rm num857}$$
Related problems
You have multivariate variables $X_1,\ldots,X_n$ so $X_1 = (X_{11},\ldots,X_{1m})$
- Find a new set of multivariate variables that are uncorrelated and explain as much variance as possible.
- If you put all the variables together in one matrix, find the best matrix created with fewer variables (lower rank) that explains the original data.
The first goal is statistical and the second goal is data compression.
Related solutions - PCA/SVD
SVD
If $X$ is a matrix with each variable in a column and each observation in a row then the SVD is a "matrix decomposition"
$$ X = UDV^T$$
where the columns of $U$ are orthogonal (left singular vectors), the columns of $V$ are orthogonal (right singluar vectors) and $D$ is a diagonal matrix (singular values).
PCA
The principal components are equal to the right singular values if you first scale (subtract the mean, divide by the standard deviation) the variables.
Principal components in R - prcomp
Principal components in R - prcomp
PC1 PC2
num415 0.7081 0.7061
num857 0.7061 -0.7081
PCA on SPAM data
PCA with caret
Preprocessing with PCA
Preprocessing with PCA
Confusion Matrix and Statistics
Reference
Prediction nonspam spam
nonspam 646 51
spam 64 389
Accuracy : 0.9
95% CI : (0.881, 0.917)
No Information Rate : 0.617
P-Value [Acc > NIR] : <2e-16
Kappa : 0.79
Mcnemar's Test P-Value : 0.263
Sensitivity : 0.910
Specificity : 0.884
Pos Pred Value : 0.927
Neg Pred Value : 0.859
Prevalence : 0.617
Detection Rate : 0.562
Detection Prevalence : 0.606
'Positive' Class : nonspam
Alternative (sets # of PCs)
Confusion Matrix and Statistics
Reference
Prediction nonspam spam
nonspam 660 37
spam 54 399
Accuracy : 0.921
95% CI : (0.904, 0.936)
No Information Rate : 0.621
P-Value [Acc > NIR] : <2e-16
Kappa : 0.833
Mcnemar's Test P-Value : 0.0935
Sensitivity : 0.924
Specificity : 0.915
Pos Pred Value : 0.947
Neg Pred Value : 0.881
Prevalence : 0.621
Detection Rate : 0.574
Detection Prevalence : 0.606
'Positive' Class : nonspam
Final thoughts on PCs
- Most useful for linear-type models
- Can make it harder to interpret predictors
- Watch out for outliers!
- Transform first (with logs/Box Cox)
- Plot predictors to identify problems
- For more info see
- Exploratory Data Analysis
- Elements of Statistical Learning
Key ideas
- Fit a simple regression model
- Plug in new covariates and multiply by the coefficients
- Useful when the linear model is (nearly) correct
Pros:
- Easy to implement
- Easy to interpret
Cons:
- Often poor performance in nonlinear settings
Example: Old faithful eruptions
Image Credit/Copyright Wally Pacholka http://www.astropics.com/
Example: Old faithful eruptions
eruptions waiting
6 2.883 55
11 1.833 54
16 2.167 52
19 1.600 52
22 1.750 47
27 1.967 55
Eruption duration versus waiting time
Fit a linear model
$$ ED_i = b_0 + b_1 WT_i + e_i $$
Call:
lm(formula = eruptions ~ waiting, data = trainFaith)
Residuals:
Min 1Q Median 3Q Max
-1.2699 -0.3479 0.0398 0.3659 1.0502
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.79274 0.22787 -7.87 1e-12 ***
waiting 0.07390 0.00315 23.47 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.495 on 135 degrees of freedom
Multiple R-squared: 0.803, Adjusted R-squared: 0.802
F-statistic: 551 on 1 and 135 DF, p-value: <2e-16
Model fit
Predict a new value
$$\hat{ED} = \hat{b}_0 + \hat{b}_1 WT$$
(Intercept)
4.119
1
4.119
Plot predictions - training and test
Get training set/test set errors
[1] 5.752
[1] 5.839
Prediction intervals
Same process with caret
Call:
lm(formula = modFormula, data = data)
Residuals:
Min 1Q Median 3Q Max
-1.2699 -0.3479 0.0398 0.3659 1.0502
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.79274 0.22787 -7.87 1e-12 ***
waiting 0.07390 0.00315 23.47 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.495 on 135 degrees of freedom
Multiple R-squared: 0.803, Adjusted R-squared: 0.802
F-statistic: 551 on 1 and 135 DF, p-value: <2e-16
Notes and further reading
- Regression models with multiple covariates can be included
- Often useful in combination with other models
- Elements of statistical learning
- Modern applied statistics with S
- Introduction to statistical learning
Example: predicting wages
Image Credit http://www.cahs-media.org/the-high-cost-of-low-wages
Data from: ISLR package from the book: Introduction to statistical learning
Example: Wage data
year age sex maritl race
Min. :2003 Min. :18.0 1. Male :3000 1. Never Married: 648 1. White:2480
1st Qu.:2004 1st Qu.:33.8 2. Female: 0 2. Married :2074 2. Black: 293
Median :2006 Median :42.0 3. Widowed : 19 3. Asian: 190
Mean :2006 Mean :42.4 4. Divorced : 204 4. Other: 37
3rd Qu.:2008 3rd Qu.:51.0 5. Separated : 55
Max. :2009 Max. :80.0
education region jobclass health
1. < HS Grad :268 2. Middle Atlantic :3000 1. Industrial :1544 1. <=Good : 858
2. HS Grad :971 1. New England : 0 2. Information:1456 2. >=Very Good:2142
3. Some College :650 3. East North Central: 0
4. College Grad :685 4. West North Central: 0
5. Advanced Degree:426 5. South Atlantic : 0
6. East South Central: 0
(Other) : 0
health_ins wage
1. Yes:2083 Min. : 20.1
2. No : 917 1st Qu.: 85.4
Median :104.9
Mean :111.7
3rd Qu.:128.7
Max. :318.3
Get training/test sets
[1] 898 12
Feature plot
Plot age versus wage
Plot age versus wage colour by jobclass
Plot age versus wage colour by education
Fit a linear model
$$ ED_i = b_0 + b_1 age + b_2 I(Jobclass_i="Information") + \sum_{k=1}^4 \gamma_k I(education_i= level k) $$
Linear Regression
2102 samples
11 predictors
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ...
Resampling results
RMSE Rsquared RMSE SD Rsquared SD
40 0.2 1 0.02
Education levels: 1 = HS Grad, 2 = Some College, 3 = College Grad, 4 = Advanced Degree
Diagnostics
Color by variables not used in the model
Plot by index
Predicted versus truth in test set
If you want to use all covariates
Notes and further reading
- Often useful in combination with other models
- Elements of statistical learning
- Modern applied statistics with S
- Introduction to statistical learning