Data-driven Methods For Predictive Modelling - Cornell University

Transcription

Data-drivenmethodsfor ation riven methodsfor predictive modellingClassification &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesD G RossiterClassificationtreesRandomforestsBagging andbootstrappingCornell University, Soil & Crop Sciences SectionNanjing Normal University, Geographic Sciences DepartmentW 'f0 ffbBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs.April 9, 2020

Data-drivenmethodsfor ation fication &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubist1 Modelling culturesExplanation vs. predictionData-driven (algorithmic) methods2 Classification & Regression Trees (CART)Regression treesSensitivity of Regression TreesClassification trees3 Random forestsBagging and bootstrappingBuilding a random forestVariable importanceRandom forests for categorical variablesPredictor selection4 Cubist5 Model tuning6 Spatial random forestsModel tuningSpatial randomforestsData-driven vs.7 Data-driven vs. model-driven methods

Data-drivenmethodsfor ation fication &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubist1 Modelling culturesExplanation vs. predictionData-driven (algorithmic) methods2 Classification & Regression Trees (CART)Regression treesSensitivity of Regression TreesClassification trees3 Random forestsBagging and bootstrappingBuilding a random forestVariable importanceRandom forests for categorical variablesPredictor selection4 Cubist5 Model tuning6 Spatial random forestsModel tuningSpatial randomforestsData-driven vs.7 Data-driven vs. model-driven methods

Data-drivenmethodsfor predictivemodellingStatistical modellingDGR/W'ôModellingculturesExplanation fication &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesClassificationtreesRandomforests Statistics starts with data: something we have measured Data is generated by some (unknown) mechanism: input(stimulus) x, output (response) y Before analysis this is a black box to us, we only have thedata itself Two goals of analysis:Bagging andbootstrapping1Building a randomforest2VariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs.Prediction of future responses, given known inputsExplanation, Understanding of what is in the “black box”(i.e., make it “white” or at least “some shade of grey”).

Data-drivenmethodsfor predictivemodellingModelling culturesDGR/W'ôModellingculturesExplanation fication &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs.Data modelling (also called “model-based”) assume an empirical-statistical (stochastic)data model for the inside of the black box,e.g., a functional form such as multiplelinear, exponential, hierarchical . . . parameterize the model from the data evaluate the model using model diagnosticsAlgorithmic modelling (also called “data-driven”) find an algorithm that produces y given x evaluate by predictive accuracy (note: notinternal accuracy)Reference: Breiman, L. (2001). Statistical Modeling: The Two Cultures (with commentsand a rejoinder by the author). Statistical Science, 16(3), 199–231.https://doi.org/10.1214/ss/1009213726

Data-drivenmethodsfor predictivemodellingExplanation vs. predictionDGR/W'ôModellingculturesExplanation fication &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforest Explanation Testing a causal theory – why are things the way they are? Emphasis is on correct model specification andcoefficient estimation Uses conceptual variables based on theory, which arerepresented by measureable variables Prediction Predicting new (space, members of population) or future(time) observations. Uses measureable variables only, no need for conceptsVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs.Reference: Shmueli, G. (2010). To Explain or to Predict? Statistical Science, 25(3),289–310. https://doi.org/10.1214/10-STS330

Data-drivenmethodsfor predictivemodellingBias/variance tradeoffDGR/W'ôModellingculturesExplanation fication &RegressionTrees (CART)The expected prediction error (EPE) for a new observation withvalue x is:EPE E{Y fb(x)}2 E{Y f (x)}2 {E(fb(x)) f (x)}2 E{fb(x) E(fb(x))}2 Var(Y ) Bias2 Var(fb(x))Regression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs.Model variance: residual error with perfect model specification(i.e., noise in the relation)Bias: mis-specification of the statistical model:fb(x) 6 f (x)Estimation variance: the result of using a sample to estimate fas fb(x)

Data-drivenmethodsfor predictivemodellingDGR/W'ôBias/variance tradeoff: explanation vs.predictionModellingculturesExplanation fication &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs.Explanation Bias should be minimized correct model specification and correctcoefficients correct conclusions about thetheory (e.g., causual relation)Prediction Total EPE should be minimized. accept some bias if that reduces theestimation variance a simpler model (omitting less importantpredictors) often has better fit to the data

Data-drivenmethodsfor predictivemodellingDGR/W'ôWhen does an underspecified model betterpredict than a full model?ModellingculturesExplanation fication &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs. the data are very noisy (large σ ); the true absolute values of the left-out parameters aresmall; the predictors are highly correlated; and the sample size is small or the range of left-out variablesis narrow.

Data-drivenmethodsfor predictivemodellingProblems with data modellingDGR/W'ôModellingculturesExplanation fication &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs. Mosteller and Tukey(1977): “The whole area of guidedregression [an example of, model-based inference] isfraught with intellectual, statistical, computational, andsubject matter difficulties.” It seems we understand nature if we fit a model form, butin fact our conclusions are about the model’s mechanism,and not necessarily about nature’s mechanism. So, if the model is a poor emulation of nature, theconclusions about nature may be wrong . . . . . . and of course the predictions may be wrong – we areincorrectly extrapolating.

Data-drivenmethodsfor predictivemodellingThe philosophy of data-driven methodsDGR/W'ôModellingculturesExplanation fication &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs. Also called “statistical learning”, “machine learning” Build structures to represent the “black box” without usinga statistical model Model quality is evaluated by predictive accuracy on testsets covering the target population cross-validation methods can use (part of) the originaldata set if an independent set is not available

Data-drivenmethodsfor predictivemodellingSome data-driven methodsDGR/W'ôModellingculturesExplanation fication &RegressionTrees (CART)1 Classification & Regression Trees (CART) Random Forests (RF) :î— CubistRegression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs.Covered in this lecture2{Others Artificial Neural Networks (ANN) ºå ÏQÜ Support Vector Machines Gradient BoostingÞR

Data-drivenmethodsfor predictivemodellingKey references – textsDGR/W'ôModellingculturesExplanation fication &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs. Hastie, T., Tibshirani, R., & Friedman J. H. (2009). The elements ofstatistical learning data mining, inference, and prediction (2nd ed). NewYork: Springer. https://doi.org/10.1007%2F978-0-387-84858-7 James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). Anintroduction to statistical learning: with applications in R. New York:Springer. https://doi.org/10.1007%2F978-1-4614-7138-7 Statistical Learning on-line course (based on James et al. tiesSciences/StatLearning/Winter2016/about Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling (2013edition). New York: Springer.https://doi.org/10.1007/978-1-4614-6849-3

Data-drivenmethodsfor predictivemodellingKey references – papersDGR/W'ôModellingculturesExplanation fication &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs. Shmueli, G. (2010). To Explain or to Predict? Statistical Science, 25(3),289–310. https://doi.org/10.1214/10-STS330 Breiman, L. (2001). Statistical Modeling: The Two Cultures (withcomments and a rejoinder by the author). Statistical Science, 16(3),199–231. https://doi.org/10.1214/ss/1009213726 Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.https://doi.org/10.1023/A:1010933404324 Kuhn, M. (2008). Building Predictive Models in R Using the caretPackage. Journal of Statistical Software, 28(5), 1–26.

Data-drivenmethodsfor ation fication &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubist1 Modelling culturesExplanation vs. predictionData-driven (algorithmic) methods2 Classification & Regression Trees (CART)Regression treesSensitivity of Regression TreesClassification trees3 Random forestsBagging and bootstrappingBuilding a random forestVariable importanceRandom forests for categorical variablesPredictor selection4 Cubist5 Model tuning6 Spatial random forestsModel tuningSpatial randomforestsData-driven vs.7 Data-driven vs. model-driven methods

Decision trees ³VData-drivenmethodsfor ation fication &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs. Typical uses in diagnostics (medical, automotive . . . ) Begin with the full set of possible decisions Split into two (binary) subsets based on the values ofsome decision criterion Each branch has a more limited set of decisions, or atleast has more information to help make a decision Continue recursively on both branches until there isenough information to make a decision

Data-drivenmethodsfor predictivemodellingClassification & Regression Trees{ ÞRDGR/W'ôModellingculturesExplanation fication &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs. A type of decision tree; decision is “what is the predictedresponse, given values of predictors”? Aim is to predict the response (target) variable from oneor more predictor variables If response is categorical (class, factor) we build aclassification tree If response is continuous we build a regression tree Predictors can be any combination of categorical orcontinuous

Data-drivenmethodsfor predictivemodellingAdvantages of CARTDGR/W'ôModellingculturesExplanation fication &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs. A simple model, no statistical assumptions other thanbetween/within class variance to decide on splits For example, no assumptions of the distribution ofresiduals So can deal with non-linear and threshold relations No need to transform predictors or response variable Predictive power is quantified by cross-validation; thisalso controls complexity to avoid over-fitting

Data-drivenmethodsfor predictivemodellingDisadvantages of CARTDGR/W'ôModellingculturesExplanation fication &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs. No model to interpret (although we can see variableimportance) Predictive power over a population depends on a samplethat is representative of that population Quite sensitive to the sample, even when pruned Pruning to a complexity parameter depends on 10-foldcross-validation, which is sensitive to the choice ofobservations in each fold Typically makes only a small number of differentpredictions (“boxes”), so maps made with it showdiscontinuities (“jumps”)

Data-drivenmethodsfor predictivemodellingTree terminologyDGR/W'ôModellingculturesExplanation fication &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforestVariableimportance splitting variable variable to examine, to decide whichbranch of the tree to follow root node 9肹 variable used for first split; overallmean and total number of observations interior node öP‚¹ splitting variable, value on whichto split, mean and number to be split leaf öP¹ predicted value, number of observationscontributing to it cutpoint of the splitting variable: value used to decidewhich branch to followRandom forestsfor categoricalvariables growing the treePredictor selection pruning the treeCubistModel tuningSpatial randomforestsData-driven vs.

Data-drivenmethodsfor predictivemodellingExample regression treeDGR/W'ôModellingculturesExplanation fication &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs. Meuse River soil heavy metals dataset Response variable: log(Zn) concentration in topsoil Predictor variablesdistance to Meuse river (continuous)elevation above sea level (continuous)3 flood frequency class (categorical, 3 classes)12

Data-drivenmethodsfor predictivemodellingExample regression tree – first splitDGR/W'ôModellingculturesExplanation fication &RegressionTrees (CART)2.56n 155dist.m 145Regression trees 145Sensitivity ofRegression TreesClassificationtreesRandomforests2.39n 1012.87n 54Bagging andbootstrappingBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs.Splitting variable: distance to riverIs the point closer or further than 145 m from the river? 101points yes, 54 points no.

Data-drivenmethodsfor predictivemodellingExplanation of first splitDGR/W'ôModellingculturesExplanation fication &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrapping root: average log(Zn) of whole dataset 2.56 log(mg kg-1)fine soil; based on all 155 observations splitting variable at root: distance to river cutpoint at root: 145 m leaves distance 145 m: 54 observations, their mean is 2.87log(mg kg-1)Building a randomforest distance 145 m: 101 observations, their mean is 2.39log(mg kg-1)Variableimportance full dataset has been split into two more homogeneousRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs.subsets

Data-drivenmethodsfor predictivemodellingExample regression tree – second splitDGR/W'ôModellingculturesExplanation vs.prediction2.56n 155Data-driven(algorithmic)methodsdist.m 145Classification &RegressionTrees (CART) 1452.39n 101Regression treesSensitivity ofRegression TreesClassificationtreeselev 6.94Building a randomforestelev 8.15 6.94RandomforestsBagging andbootstrapping2.87n 542.35n 932.84n 8 8.152.65n 152.96n 39VariableimportanceRandom forestsfor categoricalvariablesPredictor selectionFor both branches, what is the elevation of the point?CubistModel tuningSpatial randomforestsData-driven vs.Note: this is a coincidence in this case, different splittingvariables can be used on different branches.

Data-drivenmethodsfor predictivemodellingExplanation of second splitDGR/W'ôModellingculturesExplanation fication &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs. interior nodes were leaves after the first split, now‘roots’ of subtrees left: distance 145 m: 101 observations, their mean is2.39 log(mg kg-1) – note smaller mean on left right: distance 145 m: 54 observations, their mean is2.87 log(mg kg-1) splitting variable at interior node for 145 m: elevation cutpoint at interior node for 145 m: 8.15 m.a.s.l. splitting variable at interior node for 145 m: elevation cutpoint at interior node for 145 m: 6.95 m.a.s.l. leaves 93, 8, 15, 39 observations; means 2.35, 2.84,2.65, 2.96 log(mg kg-1) These leaves are now more homogeneous than the interiornodes.

Data-drivenmethodsfor predictivemodellingExample regression tree – third splitDGR/W'ôModellingculturesExplanation vs.predictionData-driven(algorithmic)methods2.56n 155dist.m 145Classification &RegressionTrees (CART) 1452.39n 101Regression treesSensitivity ofRegression TreesClassificationtreeselev 6.94elev 8.15 6.94RandomforestsBagging andbootstrapping2.87n 54 8.152.35n 932.96n 39dist.m 230Building a randomforestdist.m 75 75 230VariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs.2.31n 782.55n 152.84n 82.65n 152.85n 113n 28

Data-drivenmethodsfor predictivemodellingExample regression tree – fourth splitDGR/W'ôModellingculturesExplanation vs.prediction2.56n 155Data-driven(algorithmic)methodsdist.m 145 145Classification &RegressionTrees (CART)2.39n 101Regression treeselev 6.94elev 8.15Sensitivity ofRegression Trees 6.94Predictor selectionCubistModel tuningSpatial randomforestsData-driven vs.dist.m 75 75 2302.31n 78elev 9.03 9.03VariableimportanceRandom forestsfor categoricalvariables2.96n 39dist.m 230Bagging andbootstrappingBuilding a randomforest 8.152.35n 93ClassificationtreesRandomforests2.87n 542.22n 292.37n 492.55n 152.84n 82.65n 152.85n 113n 28

Data-drivenmethodsfor predictivemodellingExample regression tree – fifth splitDGR/W'ôModellingculturesExplanation vs.prediction2.56n 155Data-driven(algorithmic)methodsdist.m 145 1452.39n 101Classification &RegressionTrees (CART)elev 8.15 6.94Regression treesdist.m 75 75 2302.31n 783n 28elev 9.03elev 6.99 9.032.22n 29Building a randomforestVariableimportance2.96n 39dist.m 230ClassificationtreesBagging andbootstrapping 8.152.35n 93Sensitivity ofRegression TreesRandomforests2.87n 54elev 6.94 6.992.37n 49dist.m 6702.97n 21elev 7.69dist.m 365 670 365 7.69Random forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs.2.11n 82.26n 212.34n 312.41n 182.55n 152.84n 82.65n 152.85n 112.92n 143.07n 73.08n 7

Data-drivenmethodsfor predictivemodellingDGR/W'ôExample regression tree – maximum possiblesplitsModellingculturesExplanation vs.predictionData-driven(algorithmic)methods2.56n 155dist.m 145 1452.39n 1012.87n 54elev 6.94elev 8.15 6.94Classification &RegressionTrees (CART)Regression trees2.35n 93 2302.31n 78elev 9.03elev 7.81 9.03dist.m 6702.51n 11Classificationtrees2.07n 32.13n 52.19n 9elev 9.07dist.m 7552.06n 2 7352.22n 2Building a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selection3 4152.27n 52.17n 4 470 10.12.16n 3dist.m 595 5952.15n 2ffreq 322.3n 22.3n 41 4752.26n 23 3302.27 2.33n 2 n 2dist.m 525dist.mdist.m 360 335 525 8.452.32n 13 3602.31n 52.28n 81 8.49 2702.3n 72.4n 3 9.0222.29n 6elev 8.62 560 8.622.29n 5dist.m 445 3002.34n 2 4902.44n 32.35n 21 36022.69n 42.8n 4dist.m 265 2653 7.822.74 2.78n 2 n 2 902.64n 3 8.64 110 1002.78n 3elev 8.9dist.m 105 8.92.61n 2ffreq 12 1052.79n 2elev 8.54 8.543.08n 7elev 5.73 7.262.85n 6 5.733.02n 153.11n 6elev 7.01dist.m 55 90dist.m 50 55 7.012.87 2.98n 2 n 22.83n 5dist.m 90 elev 8.64dist.mdist.m 110 100 elev dist.m 5.73 elev105 7.13 2952.76n 22.92n 4ffreq 1elev 7.82dist.m 90 8.82.78n 3 1202.88n 2 5.73 1052.84n 2ffreq 2 7.132.8n 3elev 7.313.03n 12dist.mdist.m 15 25 15 25 7.722.98n 7elev 7.38 15 6.89 6.483.04 3.11n 2 n 2elev 7.722.81n 23.17n 2elev 6.89elev 6.48 7.31 7.18 7.06dist.m 153.08n 412.86n 2elev 7.18elev 7.06 503.05n 13 153.11n 5elev 8.02 7.38 8.023n 43.16n 4 7.422.98n 3 7.632.96n 2 31512.3 2.43n 2 n 2 303.21n 3elev 7.82 7.823.23n 2dist.m 15elev 7.94 15 7.942.44n 10dist.mdist.m 275 280elev 7.79 275 280 7.792.4n 72.53n 3elev 7.63dist.m 335 445 3352.42n 6 7.632.47n 2dist.m dist.m475 360 395 8.87 475 3602.27n 2elev 8.88ffreq 1elev 7.26 7.622.76n 42.89n 22.46n 11dist.m 3152.28 2.3n 3 n 2dist.m elev395 8.8732.76n 5elev 8.8 6.992.97n 21elev 7.62dist.m 120 8.742.63n 52.9n 5elev 7.63dist.m 15 3002.36n 4 7.98 465ffreq 2 dist.m 360 340 6.34 8.47 55elev 6.99 6.612.8n 6 7.342.43n 15 400 4552.19n 22.7n 2 652.69n 102.94n 3dist.m dist.m400 455 dist.m 3002.22n 32.85n 23n 28elev 6.61dist.m 653elev 7.34elev 7.42dist.m 30elev 9.02 dist.m 300elev 7.98dist.m 465 ffreq 32.32n 2dist.m 5602.24n 4dist.m 2702.33n 3ffreq 2 170 1902.42n 16 4352.36n 4elev 8.49 5452.28n 2ffreq ffreq2 12.65n 2dist.mdist.m 170 1901,32.32n 7dist.m 435 dist.m 490 7.73dist.m 5452.67n 3 9.71 8.52 7.762.36 2.52n 2 n 2ffreq 232.31n 12dist.m 360 360 335 7.8812.39n 23ffreq 1,21,22.3n 62.49n 5 1752.85n 11elev 8.47 55elev 8.74dist.m 2302.77n 4elev 6.34dist.m dist.m 175 340 8.68ffreq 2 700ffreq 312.41n 2ffreq 3dist.m ffreq475 2dist.m 330 elev 7.73elev 8.682.44n 4elev 8.452.29n 7ffreq 2,322.24n 32.37n 36dist.m 7002.73n 3 752.72n 112.32 2.59n 2 n 2elev 7.88elev 9.71elev 8.52elev 7.76dist.m 295 5102.46n 2 7.652.33n 6ffreq 1,3dist.m 385 9.68 3852.09n 2dist.m 5102.26n 8 7.642.56n 822.44n 2 660elev 7.65 9.592.14n 6elev 9.68ffreq 2dist.m 470elev 10.1Bagging andbootstrappingdist.m 660dist.m 415 8.442.38n 37 6452.3n 11elev 9.59 7002.12n 2dist.m 6452.16n 82.12n 3dist.m 7002.3n 10 10.32.12n 4 930 5252.32n 12elev 10.3 755 9.07elev ffreq8.44 3 9.54 5552.8n 6 8.242.54 2.37n 2 n 3dist.m 525 810dist.m 930dist.m 735Randomforests2.36n 47elev 9.54dist.m 810dist.m 555elev 7.64 dist.m 230 2502.26n 212.46n 4ffreq 22.67n 4elev 8.24dist.m 2502.11n 82.96n 39dist.m 75 8.482.81n 7 7.812.37n 49 670Sensitivity ofRegression Trees2.65n 15elev 8.48 5.982.55n 152.22n 29 8.152.84n 8elev 5.98dist.m 2302.4n 5elev 8.38 8.88 8.382.36n 4ffreq 31Cubist2.3 2.42n 2 n 2dist.mdist.m 405 415 405 415Model tuningSpatial randomforestsData-driven vs.2.07 2.11 2.12 2.16 2.11 2.15 2.2 2.3 2.21 2.26 2.32 2.32 2.34 2.41 2.08 2.26 2.29 2.33 2.35 2.15 2.28 2.3 2.32 2.25 2.38 2.47 2.2 2.3 2.35 2.28 2.41 2.45 2.28 2.4 2.53 2.43 2.67 2.7 2.54 2.38 2.31 2.51 2.68 2.67 2.47 2.76 2.73 2.77 2.84 2.89 2.28 2.57 2.4 2.62 2.83 2.74 2.85 2.92 2.75 2.79 2.89 2.85 2.9 2.78 2.82 2.87 2.76 2.88 3.04 2.97 3.08 3.01 3.19 3.22 3.02 3.06 3.14n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 12.05 2.1 2.11 2.13 2.08 2.15 2.18 2.14 2.38 2.26 2.28 2.22 2.31 2.4 2.45 2.22 2.27 2.32 2.34 2.57 2.26 2.3 2.31 2.34 2.3 2.52 2.19 2.27 2.35 2.6 2.19 2.41 2.22 2.33 2.45 2.56 2.51 2.65 2.54 2.23 2.5 2.42 2.54 2.62 2.7 2.64 2.81 2.74 2.83 2.87 3.03 2.37 2.61 2.6 2.7 2.61 2.74 2.86 2.74 2.78 2.87 2.83 2.88 3.06 2.81 2.85 2.96 2.92 2.9 2.96 3.01 2.91 3.18 3.26 2.89 3.06 3.16 3.2n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1 n 1

Data-drivenmethodsfor predictivemodellingHow are splits decided?DGR/W'ôModellingculturesExplanation fication &RegressionTrees (CART)1Take all possible predictors and all possible cutpoints2Split the data(sub)set at all combinations3Compute some measure of discrimination for all these –i.e., a measure which determine which split is “best”4Select the predictor/split that most discriminatesRegression treesSensitivity ofRegression TreesClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs.Criteria for continuous and categorical response variables:see next slides

Data-drivenmethodsfor predictivemodellingHow are splits decided? – Continuous responseDGR/W'ôModellingculturesExplanation fication &RegressionTrees (CART)Regression treesSensitivity ofRegression TreesSelect the predictor/split that most increases between-classvariance (this decreases pooled within-class variance):XX(y ,i yl )2 iClassificationtreesRandomforestsBagging andbootstrappingBuilding a randomforestVariableimportanceRandom forestsfor categoricalvariablesPredictor selectionCubistModel tuningSpatial randomforestsData-driven vs. y ,i value i of the target in leaf yl is the

for predictive modelling DGR/W'ô Modelling cultures Explanation vs. prediction Data-driven (algorithmic) methods Classification & Regression Trees (CART) Regression trees Sensitivity of Regression Trees Classification trees Random forests Bagging and bootstrapping Building a random forest Variable importance Random forests for categorical .