Predictive Modeling With R And The Caret Package UseR! 2013

Transcription

Predictive Modeling with R and the caret PackageuseR! 2013Max Kuhn, Ph.DPfizer Global R&DGroton, CTmax.kuhn@pfizer.comOutlineConventions in RData Splitting and Estimating PerformanceData Pre-ProcessingOver–Fitting and ResamplingTraining and Tuning Tree ModelsTraining and Tuning A Support Vector MachineComparing ModelsParallel ProcessingMax Kuhn (Pfizer)Predictive Modeling2 / 126

Predictive ModelingPredictive modeling (aka machine learning)(aka pattern recognition)(. . .)aims to generate the most accurate estimates of some quantity or event.As these models are not generally meant to be descriptive and are usuallynot well–suited for inference.Good discussions of the contrast between predictive anddescriptive/inferential models can be found in Shmueli (2010) andBreiman (2001)Frank Harrell’s Design package is very good for modern approaches tointerpretable models, such as Cox’s proportional hazards model or ordinallogistic regression.Hastie et al (2009) is a good reference for theoretical descriptions of thesemodels while Kuhn and Johnson (2013) focus on the practice of predictivemodeling (and uses R).Max Kuhn (Pfizer)Predictive ModelingModeling Conventions in R3 / 126

The Formula InterfaceThere are two main conventions for specifying models in R: the formulainterface and the non–formula (or “matrix”) interface.For the former, the predictors are explicitly listed in an R formula thatlooks like: outcome var1 var2 .For example, the formulamodelFunction(price numBedrooms numBaths acres,data housingData)would predict the closing price of a house using three quantitativecharacteristics.Max Kuhn (Pfizer)Predictive Modeling5 / 126The Formula InterfaceThe shortcut y . can be used to indicate that all of the columns in thedata set (except y) should be used as a predictor.The formula interface has many conveniences. For example,transformations, such as log(acres) can be specified in–line.It also autmatically converts factor predictors into dummy variables (usinga less than full rank encoding). For some R functions (e.g.klaR:::NaiveBayes, rpart:::rpart, C50:::C5.0, . . . ), predictors are keptas factors.Unfortunately, R does not efficiently store the information about theformula. Using this interface with data sets that contain a large number ofpredictors may unnecessarily slow the computations.Max Kuhn (Pfizer)Predictive Modeling6 / 126

The Matrix or Non–Formula InterfaceThe non–formula interface specifies the predictors for the model using amatrix or data frame (all the predictors in the object are used in themodel).The outcome data are usually passed into the model as a vector object.For example:modelFunction(x housePredictors, y price)In this case, transformations of data or dummy variables must be createdprior to being passed to the function.Note that not all R functions have both interfaces.Max Kuhn (Pfizer)Predictive Modeling7 / 126Building and Predicting ModelsModeling in R generally follows the same workflow:1Create the model using the basic function:fit - knn(trainingData, outcome, k 5)2Assess the properties of the model using print, plot. summary orother methods3Predict outcomes for samples using the predict method:predict(fit, newSamples).The model can be used for prediction without changing the original modelobject.Max Kuhn (Pfizer)Predictive Modeling8 / 126

Model Function ConsistencySince there are many modeling packages written by di erent people, thereare some inconsistencies in how models are specified and predictions aremade.For example, many models have only one method of specifying the model(e.g. formula method only)Max Kuhn (Pfizer)Predictive Modeling9 / 126Generating Class Probabilities Using Di erent Packagesobj tsgbmmdarpartRWekacaToolsMax Kuhn (Pfizer)predict Function Syntaxpredict(obj) (no options needed)predict(obj, type "response")predict(obj, type "response", n.trees)predict(obj, type "posterior")predict(obj, type "prob")predict(obj, type "probability")predict(obj, type "raw", nIter)Predictive Modeling10 / 126

type "what?" (Per Package)Max Kuhn (Pfizer)Predictive Modeling11 / 126The caret PackageThe caret package was developed to:create a unified interface for modeling and prediction (interfaces to147 models)streamline model tuning using resamplingprovide a variety of “helper” functions and classes for day–to–daymodel building tasksincrease computational efficiency using parallel processingFirst commits within Pfizer: 6/2005First version on CRAN: 10/2007Website/detailed help pages: http://caret.r-forge.r-project.orgJSS Paper: http://www.jstatsoft.org/v28/i05/paperApplied Predictive Modeling Blog: http://appliedpredictivemodeling.com/Max Kuhn (Pfizer)Predictive Modeling12 / 126

Illustrative Data: Image SegmentationWe’ll use data from Hill et al (2007) to model how well cells in an imageare segmented (i.e. identified) in “high content screening” (Abraham et al,2004).Cells can be stained to bind to certain components of the cell (e.g.nucleus) and fixed in a substance that preserves the nature state of the cell.The sample is then interrogated by an instrument (such as a confocalmicroscope) where the dye deflects light and the detectors quantify thatdegree of scattering for that specific wavelength.If multiple characteristics of the cells are desired, then multiple dyes andmultiple light frequencies can be used simultaneously.The light scattering measurements are then processed through imagingsoftware to quantify the desired cell characteristics.Max Kuhn (Pfizer)Predictive Modeling13 / 126Illustrative Data: Image SegmentationIn these images, the bright green boundaries identify the cell nucleus, whilethe blue boundaries define the cell perimeter.Clearly some cells are well–segmented, meaning that they have an accurateassessment of the location and size of the cell. Others are poorlysegmented.If cell size, shape, and/or quantity are the endpoints of interest in a study,then it is important that the instrument and imaging software cancorrectly segment cells.Given a set of image measurements, how well can we predict which cellsare well–segmented (WS) or poorly–segmented (PS)?Max Kuhn (Pfizer)Predictive Modeling14 / 126

Illustrative Data: Image SegmentationMax Kuhn (Pfizer)Predictive Modeling15 / 126Illustrative Data: Image SegmentationThe authors scored 2019 cells into these two bins.They used four stains to highlight the cell body, the cell nucleus, actin andtubulin (parts of the cytoskeleton).These correspond to di erent optical channels (e.g. channel 3 measuresactin filaments).The data are in the caret package.The authors designated a training set (n 1009) and a test set(n 1010).Max Kuhn (Pfizer)Predictive Modeling16 / 126

Illustrative Data: Image Segmentation data(segmentationData) # get rid of the cell identifier segmentationData Cell - NULL training - subset(segmentationData, Case "Train")testing - subset(segmentationData, Case "Test")training Case - NULLtesting Case - NULLstr(training[,1:6])data.frame: 1009 obs. of 6 variables: Class: Factor w/ 2 levels "PS","WS": 1 2 1 2 1 1 1 2 2 2 . AngleCh1: num 133.8 106.6 69.2 109.4 104.3 . AreaCh1: int 819 431 298 256 258 358 158 315 246 223 . AvgIntenCh1: num 31.9 28 19.5 18.8 17.6 . AvgIntenCh2: num 207 116 102 127 125 . AvgIntenCh3: num 69.9 63.9 28.2 13.6 22.5 .Since channel 1 is the cell body, AreaCh1 measures the size of the cell.Max Kuhn (Pfizer)Predictive ModelingData Splitting and Estimating Performance17 / 126

Model Building StepsCommon steps during model building are:estimating model parameters (i.e. training models)determining the values of tuning parameters that cannot be directlycalculated from the datacalculating the performance of the final model that will generalize tonew dataHow do we “spend” the data to find an optimal model? We typically splitdata into training and test data sets:Training Set: these data are used to estimate model parameters andto pick the values of the complexity parameter(s) for the model.Test Set (aka validation set): these data can be used to get anindependent assessment of model efficacy. They should not be usedduring model training.Max Kuhn (Pfizer)Predictive Modeling19 / 126Spending Our DataThe more data we spend, the better estimates we’ll get (provided the datais accurate). Given a fixed amount of data,too much spent in training won’t allow us to get a good assessmentof predictive performance. We may find a model that fits the trainingdata very well, but is not generalizable (over–fitting)too much spent in testing won’t allow us to get a good assessment ofmodel parametersStatistically, the best course of action would be to use all the data formodel building and use statistical methods to get good estimates of error.From a non–statistical perspective, many consumers of of these modelsemphasize the need for an untouched set of samples the evaluateperformance.Max Kuhn (Pfizer)Predictive Modeling20 / 126

Spending Our DataThere are a few di erent ways to do the split: simple random sampling,stratified sampling based on the outcome, by date and methods thatfocus on the distribution of the predictors.The base R function sample can be used to create a completely randomsample of the data. The caret package has a functioncreateDataPartition that conducts data splits within groups of thedata.For classification, this would mean sampling within the classes as topreserve the distribution of the outcome in the training and test setsFor regression, the function determines the quartiles of the data set andsamples within those groupsMax Kuhn (Pfizer)Predictive Modeling21 / 126Estimating PerformanceLater, once you have a set of predictions, various metrics can be used toevaluate performance.For regression models:R 2 is very popular. In many complex models, the notion of the modeldegrees of freedom is difficult. Unadjusted R 2 can be used, but doesnot penalize complexity. (caret:::RMSE, pls:::RMSEP)the root mean square error is a common metric for understandingthe performance (caret:::Rsquared, pls:::R2)Spearman’s correlation may be applicable for models that are usedto rank samples (cor(, method "spearman"))Of course, honest estimates of these statistics cannot be obtained bypredicting the same samples that were used to train the model.A test set and/or resampling can provide good estimates.Max Kuhn (Pfizer)Predictive Modeling22 / 126

Estimating Performance For ClassificationFor classification models:overall accuracy can be used, but this may be problematic when theclasses are not balanced.the Kappa statistic takes into account the expected error rate: O1EEwhere O is the observed accuracy and E is the expected accuracyunder chance agreement (psych:::cohen.kappa, vcd:::Kappa, . . . )For 2–class models, Receiver Operating Characteristic (ROC)curves can be used to characterize model performance (more later)Max Kuhn (Pfizer)Predictive Modeling23 / 126Estimating Performance For ClassificationA “ confusion matrix” is a cross–tabulation of the observed and predictedclassesR functions for confusion matrices are in the e1071 package (theclassAgreement function), the caret package (confusionMatrix), themda (confusion) and others.ROC curve functions are found in the ROCR package (performance), theverification package (roc.area), the pROC package (roc) and others.We’ll use the confusionMatrix function and the pROC package later inthis class.Max Kuhn (Pfizer)Predictive Modeling24 / 126

Estimating Performance For ClassificationFor 2–class classification models we might also be interested in:Sensitivity: given that a result is truly an event, what is theprobability that the model will predict an event results?Specificity: given that a result is truly not an event, what is theprobability that the model will predict a negative results?(an “event” is really the event of interest)These conditional probabilities are directly related to the false positive andfalse negative rate of a method.Unconditional probabilities (the positive–predictive values andnegative–predictive values) can be computed, but require an estimate ofwhat the overall event rate is in the population of interest (aka theprevalence)Max Kuhn (Pfizer)Predictive Modeling25 / 126Estimating Performance For ClassificationFor our example, let’s choose the event to be the poor segmentation(PS):Sensitivity # PS predicted to be PS# true PSSpecificity # true WS to be WS# true WSThe caret package has functions called sensitivity and specificityMax Kuhn (Pfizer)Predictive Modeling26 / 126

ROC CurveWith two classes the Receiver Operating Characteristic (ROC) curve canbe used to estimate performance using a combination of sensitivity andspecificity.Given the probability of an event, many alternative cuto s can beevaluated (instead of just a 50% cuto ). For each cuto , we can calculatethe sensitivity and specificity.The ROC curve plots the sensitivity (eg. true positive rate) by one minusspecificity (eg. the false positive rate).The area under the ROC curve is a common metric of performance.Max Kuhn (Pfizer)Predictive Modeling27 / 1261.0ROC Curve From Class Probabilities 0.60.750 (0.897, 0.595)0.4 0.250 (0.577, 0.919)0.500 (0.670, 0.847)0.00.2Sensitivity0.8 1.00.80.60.40.20.0SpecificityMax Kuhn (Pfizer)Predictive Modeling28 / 126

Data Pre–ProcessingPre–Processing the DataThere are a wide variety of models in R. Some models have di erentassumptions on the predictor data and may need to be pre–processed.For example, methods that use the inverse of the predictor cross–productmatrix (i.e. (X 0 X ) 1 ) may require the elimination of collinear predictors.Others may need the predictors to be centered and/or scaled, etc.If any data processing is required, it is a good idea to base thesecalculations on the training set, then apply them to any data set used formodel building or prediction.Max Kuhn (Pfizer)Predictive Modeling30 / 126

Pre–Processing the DataExamples of of pre–processing operations:centering and scalingimputation of missing datatransformations of individual predictors (e.g. Box–Coxtransformations of the predictors)transformations of the groups of predictors, such as theIIthe “spatial–sign” transformation (i.e. x 0 x / x )feature extraction via PCA or ICAMax Kuhn (Pfizer)Predictive Modeling31 / 126Centering, Scaling and TransformingThere are a few di erent functions for data processing in R:scale in base RScaleAdv in pcaPPstdize in plspreProcess in caretnormalize in sparseLDAThe first three functions do simple centering and scaling. preProcess cando a variety of techniques, so we’ll look at this in more detail.Max Kuhn (Pfizer)Predictive Modeling32 / 126

Centering and ScalingThe input is a matrix or data frame of predictor data. Once the values arecalculated, the predict method can be used to do the actual datatransformations.First, estimate the standardization parameters: trainX - training[, names(training) ! "Class"]## Methods are "BoxCox", "YeoJohnson", center", "scale",## "range", "knnImpute", "bagImpute", "pca", "ica" and## "spatialSign"preProcValues - preProcess(trainX, method c("center", "scale"))preProcValuesCall:preProcess.default(x trainX, method c("center", "scale"))Created from 1009 samples and 58 variablesPre-processing: centered, scaledApply them to the data sets: scaledTrain - predict(preProcValues, trainX)Max Kuhn (Pfizer)Predictive Modeling33 / 126Pre–Processing and ResamplingTo get honest estimates of performance, all data transformations shouldbe included within the cross–validation loop.The would be especially true for feature selection as well as pre–processingtechniques (e.g. imputation, PCA, etc)One function considered later called train that can apply preProcesswithin resampling loops.Max Kuhn (Pfizer)Predictive Modeling34 / 126

Over–Fitting and ResamplingOver–FittingOver–fitting occurs when a model inappropriately picks up on trends in thetraining set that do not generalize to new samples.When this occurs, assessments of the model based on the training set canshow good performance that does not reproduce in future samples.Some models have specific “knobs” to control over-fittingneighborhood size in nearest neighbor models is an examplethe number if splits in a tree modelOften, poor choices for these parameters can result in over-fittingFor example, the next slide shows a data set with two predictors. We wantto be able to produce a line (i.e. decision boundary) that di erentiates twoclasses of data.Two model fits are shown; one over–fits the training data.Max Kuhn (Pfizer)Predictive Modeling36 / 126

The DataClass 1Class 2Predictor B0.60.40.20.00.00.20.40.6Predictor AMax Kuhn (Pfizer)Predictive Modeling37 / 126Two Model FitsClass 1Class 20.2Model #10.40.60.8Model #2Predictor B0.80.60.40.20.20.40.60.8Predictor AMax Kuhn (Pfizer)Predictive Modeling38 / 126

Characterizing Over–Fitting Using the Training SetOne obvious way to detect over–fitting is to use a test set. However,repeated “looks” at the test set can also lead to over–fittingResampling the training samples allows us to know when we are makingpoor choices for the values of these parameters (the test set is not used).Resampling methods try to “inject variation” in the system to approximatethe model’s performance on future samples.We’ll walk through several types of resampling methods for training setsamples.Max Kuhn (Pfizer)Predictive Modeling39 / 126K –Fold Cross–ValidationHere, we randomly split the data into K distinct blocks of roughly equalsize.1We leave out the first block of data and fit a model.2This model is used to predict the held-out block3We continue this process until we’ve predicted all K held–out blocksThe final performance is based on the hold-out predictionsK is usually taken to be 5 or 10 and leave one out cross–validation haseach sample as a blockRepeated K –fold CV creates multiple versions of the folds andaggregates the results (I prefer this method)caret:::createFolds, caret:::createMultiFoldsMax Kuhn (Pfizer)Predictive Modeling40 / 126

K –Fold Cross–ValidationMax Kuhn (Pfizer)Predictive Modeling41 / 126Repeated Training/Test Splits(aka leave–group–out cross–validation)A random proportion of data (say 80%) are used to train a model whilethe remainder is used for prediction. This process is repeated many timesand the average performance is used.These splits can also be generated using stratified sampling.With many iterations (20 to 100), this procedure has smaller variance thanK –fold CV, but is likely to be biased.caret:::createDataPartitionMax Kuhn (Pfizer)Predictive Modeling42 / 126

Repeated Training/Test SplitsMax Kuhn (Pfizer)Predictive Modeling43 / 126BootstrappingBootstrapping takes a random sample with replacement. The randomsample is the same size as the original data set.Samples may be selected more than once and each sample has a 63.2%chance of showing up at least once.Some samples won’t be selected and these samples will be used to predictperformance.The process is repeated multiple times (say 30–100).This procedure also has low variance but non–zero bias when compared toK –fold CV.sample, caret:::createResampleMax Kuhn (Pfizer)Predictive Modeling44 / 126

BootstrappingMax Kuhn (Pfizer)Predictive Modeling45 / 126The Big PictureWe think that resampling will give us honest estimates of futureperformance, but there is still the issue of which model to select.One algorithm to select models:Define sets of model parameter values to evaluate;for each parameter set dofor each resampling iteration doHold–out specific samples ;Fit the model on the remainder;Predict the hold–out samples;endCalculate the average performance across hold–out predictionsendDetermine the optimal parameter set;Max Kuhn (Pfizer)Predictive Modeling46 / 126

K –Nearest Neighbors ClassificationClass 1Class 2 Predictor B0.60.40.20.00.00.20.40.6Predictor AMax Kuhn (Pfizer)Predictive Modeling47 / 126The Big Picture – K NN ExampleUsing k –nearest neighbors as an example:Randomly put samples into 10 distinct groups;for i 1 . . . 30 doCreate a bootstrap sample;Hold–out data not in sample;for k 1, 3, . . . 29 doFit the model on the boostrapped sample;Predict the i th holdout and save results;endCalculate the average accuracy across the 30 hold–out sets ofpredictionsendDetermine k based on the highest cross–validated accuracy;Max Kuhn (Pfizer)Predictive Modeling48 / 126

The Big Picture – K NN Example0.85Holdout borsMax Kuhn (Pfizer)Predictive Modeling49 / 126A General StrategyThere is usually a inverse relationship between model flexibility/power andinterpretability.In the best case, we would like a parsimonious and interpretable modelthat has excellent performance.Unfortunately, that is not usually realistic.One strategy:1start with the most powerful black–box type models2get a sense of the best possible performance3then fit more simplistic/understandable models4evaluate the performance cost of using a simpler modelMax Kuhn (Pfizer)Predictive Modeling50 / 126

Training and Tuning Tree ModelsClassification TreesA classification tree searches through each predictor to find a value of asingle variable that best splits the data into two groups.typically, the best split minimizes impurity of the outcome in theresulting data subsets.For the two resulting groups, the process is repeated until a hierarchicalstructure (a tree) is created.in e ect, trees partition the X space into rectangular sections thatassign a single value to samples within the rectangle.Max Kuhn (Pfizer)Predictive Modeling52 / 126

An Example First Split1TotalIntenCh2 45324.5WS10.80.80.60.60.40.40.20.20Max Kuhn (Pfizer)Node 3 (n 555)PSPS1WSNode 2 (n 454) 45324.50Predictive Modeling53 / 126The Next Round of Splitting1TotalIntenCh2 45324.525IntenCoocASMCh3FiberWidthCh1 0.60218 0.60218 9.67325 9.67325Node 6 (n 154)1Node 7 (n 401)PS1PSNode 4 (n Predictive Modeling0WS0.8WS0.8WSWSPSNode 3 (n 447)Max Kuhn (Pfizer) 45324.5054 / 126

An ExampleThere are many tree–based packages in R. The main package for fittingsingle trees are rpart, RWeka, evtree, C50 and party. rpart fits theclassical “CART” models of Breiman et al (1984).To obtain a shallow tree with rpart: library(rpart) rpart1 - rpart(Class ., data training, control rpart.control(maxdepth 2)) rpart1n 1009node), split, n, loss, yval, (yprob)* denotes terminal node1) root 1009 373 PS (0.63032706 0.36967294)2) TotalIntenCh2 45324.5 454 34 PS (0.92511013 0.07488987)4) IntenCoocASMCh3 0.6021832 447 27 PS (0.93959732 0.06040268) *5) IntenCoocASMCh3 0.6021832 70 WS (0.00000000 1.00000000) *3) TotalIntenCh2 45324.5 555 216 WS (0.38918919 0.61081081)6) FiberWidthCh1 9.673245 154 47 PS (0.69480519 0.30519481) *7) FiberWidthCh1 9.673245 401 109 WS (0.27182045 0.72817955) *Max Kuhn (Pfizer)Predictive Modeling55 / 126Visualizing the TreeThe rpart package has functions plot.rpart and text.rpart tovisualize the final tree.The partykit package (at r-forge.r-project.org) also has enhancedplotting functions for recursive partitioning. We can convert the rpartobject to a new class called party and plot it to see more in the terminalnodes: rpart1a - as.party(rpart1) plot(rpart1a)Max Kuhn (Pfizer)Predictive Modeling56 / 126

A Shallow rpart Tree Using the party Package1TotalIntenCh2 45324.525IntenCoocASMCh3FiberWidthCh1 0.60218 0.60218 9.67325 9.67325Node 6 (n 154)1Node 7 (n 401)PS1PSNode 4 (n ax Kuhn (Pfizer)00WS0.8WS0.8WSPSNode 3 (n 447)WS 45324.50Predictive Modeling57 / 126Tree Fitting ProcessSplitting would continue until some criterion for stopping is met, such asthe minimum number of observations in a nodeThe largest possible tree may over-fit and “pruning” is the process ofiteratively removing terminal nodes and watching the changes inresampling performance (usually 10–fold CV)There are many possible pruning paths: how many possible trees are therewith 6 terminal nodes?Trees can be indexed by their maximum depth and the classical CARTmethodology uses a cost-complexity parameter (Cp ) to determine besttree depthMax Kuhn (Pfizer)Predictive Modeling58 / 126

The Final TreePreviously, we told rpart to use a maximum of two splits.By default, rpart will conduct as many splits as possible, then use 10–foldcross–validation to prune the tree.Specifically, the “one SE” rule is used: estimate the standard error ofperformance for each tree size then choose the simplest tree within onestandard error of the absolute best tree size. rpartFull - rpart(Class ., data training)Max Kuhn (Pfizer)Predictive Modeling59 / 126Tree Growing and Pruning1TotalIntenCh2 45324.5 45324.529IntenCoocASMCh3FiberWidthCh1 0.60218 9.67325 9.673251023AvgIntenCh1 323.924295ConvexHullAreaRatioCh1 323.92429 1.1736211SkewIntenCh1 1.1736224VarIntenCh431VarIntenCh4 112.26706 112.26706 172.01649ShapeP2ACh1 172.01649 1.304051326IntenCoocASMCh4 0.1665351AvgIntenCh4 4.05017 4.0501715TotalIntenCh1 375.20503 375.20503 26187.5 26187.527NeighborAvgDistCh1 218.7679934AvgIntenCh452LengthCh1 218.76799ConvexHullPerimRatioCh1 20.929211619DiffIntenDensityCh4AvgIntenCh2 1.3040532KurtIntenCh3 0.16653Accuracy 0.602183AvgIntenCh2 124.04464 124.04464 20.9292136NeighborMinDistCh1 22.02943 22.029433740AvgIntenCh1IntenCoocASMCh3 0.0281 0.0281 0.00949 0.0094942IntenCoocEntropyCh4 77.05083 77.05083 6.25503 6.25503 0.97174 0.9717443 42.08533 42.08533 197.747 197.747DiffIntenDensityCh4 110.64524 110.64524 44.40859 44.4085944NeighborMinDistCh1 25.63613 25.6361345DiffIntenDensityCh4 82.94862 82.948620.6Node 54 (n 26)10.80.6Node 55 (n 116)10.60.40.20PS0.80.40.20PSNode 53 (n 17)1PS0.80.60.40.200.80.60.40.200.4WSNode 50 (n 53)1PS0.80.60.40.20WSNode 49 (n 26)110.80.60.40.20PSPSNode 48 (n 9)0.80.60.40.20WSNode 47 (n 15)1PS10.8WSPSNode 46 (n 9)0.60.40.20WS10.80.60.40.20WSNode 41 (n 8)PS0.80.60.40.20WSNode 39 (n 16)1PS0.80.60.40.20WSNode 38 (n 16)10.80.60.40.20PSNode 35 (n 10)1PS0.80.60.40.20WSNode 33 (n 17)1PS0.8WSPSNode 30 (n 20)10.60.40.20WS0.80.60.40.20WSPSNode 29 (n 11)10.80.60.40.20WSNode 28 (n 13)10.8PSNode 25 (n 19)10.60.40.20WS0.80.60.40.20PSNode 22 (n 15)1PS0.80.60.40.20WSNode 21 (n 13)1PS0.8WSPSNode 20 (n 12)10.60.40.20WS10.80.60.40.20WSNode 18 (n 8)0.8PSNode 17 (n 21)10.60.40.20WS0.8PSNode 14 (n 27)10.60.40.20WS0.80.60.40.20PSNode 12 (n 58)1PS10.8WSPSPS0.60.40.20WSWSNode 8 (n 7)10.80.60.40.20WSNode 7 (n 7)10.8WSPSNode 6 (n 95)0.60.40.20WSPS0.8WSNode 4 (n 345)1WSFull Tree0.2010 2.010 1.510 1.010 0.5cp 1 TotalIntenCh2 45324.5 45324.525IntenCoocASMCh3 FiberWidthCh1 9.67325 9.6732569AvgIntenCh1ConvexHullAreaRatioCh1 1.17362 1.173621015ShapeP2ACh1 172.01649 1.3040512AccuracyVarIntenCh4 172.01649 1.3040516KurtIntenCh3 0.60218AvgIntenCh4 0.60218 375.20503 375.2050318 323.92429 323.92429LengthCh1 20.92921 20.9292120 4.05017 4.05017NeighborMinDistCh1 22.02943 22.02943 21AvgIntenCh1 110.64524 110.64524Node 25 (n 159)PS10.80.60.40.20WSNode 24 (n 120)PS10.80.60.40.20WSNode 23 (n 16)PS10.80.60.40.20WSNode 22 (n 16)PS10.80.60.40.20WSNode 19 (n 10)PS10.80.60.40.20WSNode 17 (n 17)PS10.80.60.40.20WSNode 14 (n 20)PS10.80.60.40.20WSNode 13 (n 24)PS10.80.60.40.20WSNode 11 (n 19)PS10.80.60.40.20WSNode 8 (n 15)PS10.80.60.40.20WSNode 7 (n 139)PS10.80.60.40.20WSNode 4 (n 7)PSPS10.80.60.40.20WSNode 3 (n 447)WSStart Pruning10.80.60.40.2010 2.010 1.510 1.010 0.5cp 1 45324.5Accuracy 45324.53FiberWidthCh1 9.67325 9.673251Node 5 (n 401)PSNode 4 (n 154)PS10.80.60.60.40.40.40.20.20.20WS0.80.60 10.8WSPSNode 2 (n 454)WSToo Much! TotalIntenCh2 010 2.010 1.510 1.010 0.5cpMax Kuhn (Pfizer)Predictive Modeling60 / 126

The Final Tree rpartFulln 1009node), split, n, loss, yval, (yprob)* denotes terminal node1) root 1009 373 PS (0.63032706 0.36967294)2) TotalIntenCh2 45324.5 454 34 PS (0.92511013 0.07488987)4) IntenCoocASMCh3 0.6021832 447 27 PS (0.93959732 0.06040268) *5) IntenCoocASMCh3 0.6021832 70 WS (0.00000000 1.00000000) *3) TotalIntenCh2 45324.5 555 216 WS (0.38918919 0.61081081)6) FiberWidthCh1 9.673245 154 47 PS (0.69480519 0.30519481)12) AvgIntenCh1 323.9243 139 33 PS (0.76258993 0.23741007) *13) AvgIntenCh1 323.9243 151 WS (0.06666667 0.93333333) *7) FiberWidthCh1 9.673245 401 109 WS (0.27182045 0.72817955)14) ConvexHullAreaRatioCh1 1.173618 63 26 PS (0.58730159 0.41269841)28) VarIntenCh4 172.0165 192 PS (0.89473684 0.10526316) *29) VarIntenCh4 172.0165 44 20 WS (0.45454545 0.54545455)58) KurtIntenCh3 4.05017 248 PS (0.66666667 0.33333333) *59) KurtIntenCh3 4.05017 204 WS (0.20000000 0.80000000) *15) ConvexHullAreaRatioCh1 1.173618 338 72 WS (0.21301775 0.78698225)30) ShapeP2ACh1 1.304052 179 53 WS (0.29608939 0.70391061)60) AvgIntenCh4 375.205 172 PS (0.88235294 0.11764706) *61) AvgIntenCh4 375.205 162 38 WS (0.23456790 0.76543210)122) LengthCh1 20.92921 103 PS (0.70000000 0.30000000) *123) LengthCh1 20.92921 152 31 WS (0.20394737 0.79605263)246) NeighborMinDistCh1 22.02943 32 14 WS (0.43750000 0.56250000)492) AvgIntenCh1 110.6452 163 PS (0.81250000 0.18750000) *493) AvgIntenCh1 110.6452 161 WS (0.06250000 0.93750000) *247) NeighborMinDistCh1 22.02943 120 17 WS (0.14166667 0.85833333) *31) ShapeP2ACh1 1.304052 159 19 WS (0.11949686 0.88050314) *Max Kuhn (Pfizer)Predictive Modeling61 / 126The Final rpart Tree1TotalIntenCh2 45324.5 45324.525IntenCoocASMCh3FiberWidthCh1 9.67325 9.6732569AvgIntenCh1ConvexHullAreaRatioCh1 1.17362 1.173621015VarIntenCh4ShapeP2ACh1 172.01649 172.01649 1.304051216KurtIntenCh3AvgIntenCh4

Predictive Modeling with R and the caret Package useR! 2013 Max Kuhn, Ph.D Pfizer Global R&D Groton, CT max.kuhn@pfizer.com Outline Conventions in R Data Splitting and Estimating Performance Data