Predictive Modeling With The R CARET Package

Transcription

Predictive Modeling with The RCARET PackageMatthew A. Lanham, CAP (lanham@vt.edu)Doctoral CandidateDepartment of Business Information TechnologyMatthewALanham.com

RR is a software environment for data analysis, computing, and graphics.Brief History 1991, R was developed by New Zealand Professors Robert Gentleman and Ross Ihaka whowanted a better statistical software platform for their students to use in their Macintoshteaching laboratory (Ihaka & Gentleman, 1996). They decided to model the R languageafter John Chamber’s statistical S language.1993, R was first announced to the public.1995, R was made “free” under the GNU General Public License2000, R version 1.0.0 was released and the rest is history.About R is an open-source and freely accessible software language under the GNU General PublicLicense, version 2R works with Windows, Macintosh, Unix, and Linux operating systems.R is highly extensible by allowing researchers and practitioners to create their own custompackages and functions.It has a nice balance of object-oriented and functional programming constructs (Hornick &Plunkett, 2013).As of 2014 there were 5800 available user-developed packagesConsistently ranked as one of the top tools used by analytics professionals today2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

R StudioR Studio is an Integrated Development Environment (IDE) for R2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Revolution AnalyticsRevolution Analytics offers their own free version of R which runs faster Revolution R Open (RRO) en2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Revolution AnalyticsIf you choose to use RRO, the only thing you’ll really notice aside from increasedspeed is a slightly different startup at the console2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Predictive ModelingPredictive analytics is the process of building a model that predicts some output or estimatessome unknown parameter(s). Predictive analytics is synonymous with predictive modeling,which has associations with machine learning, pattern recognition, as well as data mining (M.Kuhn & Johnson, 2013).Predictive AnalyticsWhat will happen?While predictive analytics may have the terminal goal of estimating parameters as accuratelyas possible, it is more a model development process than specific techniques used (M. Kuhn &Johnson, 2013).2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

The CARET Packagehttp://topepo.github.io/caret/“The caret package (short for Classification And REgression Training) is a set offunctions that attempt to streamline the process for creating predictive models [inR].”The package contains tools for: data splitting/partitioning pre-processing feature selection model tuning using resampling (what I’m talking about today) variable importance estimation2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

The CARET PackageThere are many packages for predictive modeling. New algorithms are researched,peer-reviewed, and published every year. Often those researchers will develop aworking package of their model for use by others on CRAN R. Check out JSS for the state-of-the-art New packages get updated often, can check here CRAN R to see latestcontributions R is already viewed as having a flat learning curve (i.e. challenging), so thoseattempting to teach R using several different modeling methodologies wouldprobably have a bad experience – Using the CARET package would be ideal2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Many Models with Different SyntaxIn additional the functionality provided by CARET, the package is essentially awrapper for most of the individual predictive modeling packages .Currently there are 192 different modeling packages that it includes!! Meaning youdon’t have to use all the different syntax structures for each model you want to try.This wrapper functionality is exactly what R Users/Teachers need because many ofthe modeling packages are written by different people. There is no pre-specifieddesign pattern so there are inconsistencies in how one sets functional inputs andwhat is returned in the output.Syntax Examples2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Streamlined ModelingWith the nice wrapper functionality, this allows a more streamlined predictivemodeling process to evaluate many modeling methodologies.It will also be much efficient way to tune each model’s parameters.2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

ExamplePredicting customer churn for a telecom service based on accountinformation.2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

The Predictors and The ResponseA list of predictor names we might need laterCaret treats the first/leading factor as the factor to be modeled. Somepackages treat the second factor as the leading factor, and this isaccounted for within caret.In our case, the nominal probabilities are predicting the occurrence ofthe “yes events” or churns2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Data PartitioningHere, we are going to perform a stratified random splitWhat is going on here is the data that is both the training and test setswill both have approximately the same class imbalancePopulation (allData)Training dataset (churnTrain)2015 SEINFORMS Annual MeetingTesting dataset(churnTest)MatthewALanham.comR Caret Package

Data Pre-ProcessingpreProcess() calculates values that can be used to apply to any data set (e.g. trainingset, testing set, etc.)Typical Methods (list continues to grow): Centering Scaling Spatial sign transformation PCA Imputation Box-Cox transformation .2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Data Pre-ProcessingprocValues is a preProcess objectThe preProcess function can also be called within other functions for eachresampling iteration (upcoming.)2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Boosted Trees – Classic adaBoost AlgorithmBoostingBoosting is a method to “boost” weak learning algorithms (e.g. lone classificationtree) into strong learning algorithms.IdeaBoosted trees try to improve the model fit over different trees by considering pastfits (similar to iterative reweighted least squares)Tree Boosting Algorithm2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Boosted Trees – Classic adaBoost AlgorithmFormulationHere, the categorical response 𝑦𝑖 is coded as 1,1 and the model 𝑓𝑗 𝑥 producesvalues of 1,1Final PredictionThe final prediction is obtained by first predicting using all M trees, then weightingeach prediction, so1𝑓 𝑥 𝑀𝑀𝛽𝑗 𝑓𝑗 𝑥𝑗 1where 𝑓𝑗 is the jth tree fit and 𝛽𝑗 is the stage weight for that treeThus, the final class is determined by the sign of the model prediction.Lay termsThe final prediction generated is a weighted average of each tree’s respectiveprediction, and the weights are set based on the quality (i.e. accuracy) of each tree2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Boosted Tree ParametersTuning parametersMost implementations of boosting have three tuning parameters (could be more)1. # of boosting iterations (i.e. # of trees)2. Tree complexity (i.e. # of tree splits)3. Shrinkage (i.e. the learning rate or how quickly the algorithm adapts)Boosting functions in R Examples: gbm() in gbm package ada() in ada package blackboost() in mboost packageHow do you ttp://topepo.github.io/caret/Boosting.html2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Using the gbm PackageTuning parametersThe gbm() in the gbm package can be used to fit the model if you like, then you canuse predict.gbm or other functions to predict and evaluate the modelExample:2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Tuning a gbm ModelTuning parametersThe previous code assumed that we know appropriate values for the tuningparameters. Do we?One nice approach for model tuning is resampling, which we’ll do soon.We can fit models with different values of the tuning parameters to many resampledversions of the training set and estimate the performance based on the hold-outsamples.From this, a profile of performance can be obtained across different gbm models,and then an “optimal” set of tuning parameters can be chosen.Idea Goal is to get the resampled estimate ofperformance For 3 tuning parameters, can set up a 3-D grid andlook at the performance profile across the grid toderive what should be used as the final tuningparameters2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Tuning a gbm ModelTuning This is basically a modified cross-validation approach2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

The CARET Way - Using train()Training Using the train(), we can specify the response, predictor set, and the modelingtechnique we want (e.g. boosting via the gbm package) Since we have binary outcomes, train() models the probability of the first factorlevel. This is “yes” in our case for the churn data set. gbm knows we are doing classification because we have a binary factor If we have a numeric response, it would assume we were doing regressiontype modeling, which is possible using gbm2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Passing ParametersWe can pass options through train to gbm using the three dots A good idea to use verbose FALSE within train() for the gbm method because thereis a lot of logging that will be returnedTo Note: gbm() is the underlying function, train() is the top level function Verbose T is not an argument for the train(), but rather used for the underlyinggbm() so the output isn’t printed out every time2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Changing the Resampling MethodBy Default, train() uses the bootstrap for resampling.Repeated cross-validation “repeatedcv” has nice bias-variance properties, so we’lluse 5 repeats of 10-fold cross-validation there are 50 resampled data sets beingevaluated.trControl allows us to easily set parameters, then add trControl as an argument tothe train()2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Different Performance MetricsWhat is going to happen next is:1.2.3.A sequence of different gbm models will be fitPerformance is estimated using resamplingThe best gbm model is chosen based on best performanceWhat do you define as best performance?The default metrics for classification problems are accuracy and Cohen’s Kappa.Since the data in unbalanced, sensitivity might not be the best statistic to use forevaluation.However, suppose we wanted to estimate sensitivity, specificity, and area under theROC curve (thus select best model using greatest AUC)We’ll need to specify that within train() so that class probabilities are produced, thenestimate these statistics (i.e. AUC) to get a rank of the models.2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Different Performance MetricsThe twoClassSummary() defined in caret calculates the sensitivity, specificity, andAUC. Custom functions can also be used (check out ?train to learn more)In trainControl() can specify if you want class probabilities – not all models will provide these,so this is really usefulThe train() will want to know which one do you want to optimize for? Thus, you’ll need tospecify the metric. In this churn case, metric “ROC”. However, If you were buildingregression models, you could use metric “RMSE”2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Specifying the Search GridBy default, train() uses a minimal search grid: 3 values per tuning parameterWe can easily expand the scope of this if you like, so our models can test morecombinations of tuning parametersHere we’ll look at:1. Tree depth from 1 to 72. Boosting iterations from 100 to 1,0003. Learning rates: 0.01 (slow), 0.10 (fast)4. Minimum number of nodes: 10When we run our experiment, we’ll save our results in a data.frame2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Specifying the Search GridAdding in tuneGrid to your train() will allow you to specify all the combinations of tuningparameters you want to profile.Note that if you didn’t specify your tuneGrid ,it would only look at 3 different levels for each ofthe tuning parameters by DEFAULT2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Running our ModelOnce the model completes, we can print the object’s outputFor each tuning parameter it gives the averaged AUC. If you don’t tell it what you want it todo, it will give you the “winner” among the set (i.e. whatever the max AUC is), and thatmodel’s corresponding tuning parameters2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Evaluating our Results2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Prediction and Performance AssessmentThe predict() can be used to get results for other data setsPredicted ClassesPredicted Probabilities2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Prediction and Performance AssessmentconfusionMatrix()2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Test Set ROC (using pROC package here)2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Test Set ROC (using pROC package here)2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

That’s it!If you’re still interested, check out for much more informationhttp://topepo.github.io/caret/2015 SEINFORMS Annual MeetingMatthewALanham.comR Caret Package

Predictive Modeling Predictive analytics is the process of building a model that predicts some output or estimates some unknown parameter(s). Predictive analytics is synonymous with predictive modeling, which has associations with machine learning, pattern recogniti