Ensemble Methods - The University Of Edinburgh

Transcription

Ensemble MethodsCharles SuttonData Mining and ExplorationSpring 2012Friday, 27 January 12

Bias and VarianceConsider a regression problemY f (X) N (0,2)With an estimate regression function fˆ, e.g., ˆf (x) w xSuppose we care about the error at a particular x0L(y, fˆ(x0 )) (yfˆ(x0 ))2Let’s think about the expected error:E(L(y, fˆ(x0 ))) Z1p(y x0 )(yfˆ(x0 ))2 dy1Important: both y and fˆ are random!Friday, 27 January 12

Bias and VarianceConsider a regression problemY f (X) N (0,2)Let’s think about the expected error:E(L(y, fˆ(x0 ))) E(yfˆ(x0 ))2.after some algebra. 2 Bias2 (fˆ(x0 )) V fˆ(x0 )whereBias(fˆ(x0 )) E(f (x0 )fˆ(x0 ))expectation taken over both y and fˆFriday, 27 January 12

Bias and VarianceStable classification methods: Lower variance Higher biasFlexible methods Higher variance Lower biasLike to minimize both, but often must trade off.Data drawn from p(x, y)Friday, 27 January 12

Prediction ErrorBias and VarianceHigh BiasLow VarianceLow BiasHigh VarianceTest SampleTraining SampleLowHighModel ComplexityFIGURE 2.11. Test and training error as a functionof model complexity.Figure from [Hastie,Tibshirani, and Friedman, 2009]Friday, 27 January 12

Notationnx2RFeature vector:Classifier output: h(x) 2 { 1, 1}Ex:x1 1x2 11Friday, 27 January 12x1 0x2 002/3h([0, 1, 1]) 1

What is an ensemble? A group of classifiers that vote (perhapsweighted) on the answerh(x) MXm hm (x)i 1individual classifiersweightsFriday, 27 January 12

Why an ensemble? Smooth the variance of unstable classifiers Combine classifiers with different biases Different classifiers can “specialise” indifferent parts of the input spaceFriday, 27 January 12

Bagging We said decision trees are unstable Let’s generate a bunch of data sets, andaverage the results!h(x) MXm hm (x)i 1each data setFriday, 27 January 12probability fromdecision tree

The bootstrap But where do we get all of those data sets? Crazy idea: Let’s get them from the trainingdata. Resample with replacementFriday, 27 January 12

Example ercastrainFriday, 27 January 12TEMP HUMIDITYWIND PLAYhothighweaknohothigh strongnohothighweak yesmildhighweak yescoolnormalweak yescoolnormal strongnocoolnormal strong yesmildhighweaknocoolnormalweak yesmildnormalweak yesmildnormal strong yesmildhigh strong yeshotnormalweak yesmildhigh ercastsunnyrainsunnysunnyTEMP HUMIDITYWIND PLAYhothighweaknohothighweaknohothighweak yesmildhighweak yesmildhighweak yescoolnormal strongnocoolnormal strong yescoolnormal strong yescoolnormal strong yescoolnormal strong yesmildhighweaknomildnormalweak yesmildnormal strong yesmildnormal strong yes

Example ercastrainResampledTEMP HUMIDITYWIND PLAYhothighweaknohothigh strongnohothighweak yesmildhighweak yescoolnormalweak yescoolnormal strongnocoolnormal strong yesmildhighweaknocoolnormalweak yesmildnormalweak yesmildnormal strong yesmildhigh strong yeshotnormalweak yesmildhigh nyrainsunnysunnyTEMP HUMIDITYWIND PLAYhothighweaknohothighweaknohothighweak yesmildhighweak yesmildhighweak yescoolnormal strongnocoolnormal strong yescoolnormal strong yescoolnormal strong yescoolnormal strong yesmildhighweaknomildnormalweak yesmildnormal strong yesmildnormal strong yesRun decision tree learningOUTLOOK: rain,sunny TEMP: cool,hotnoOUTLOOK: rainyesFriday, 27 January 12yesHUMIDITY: high

Example ercastrainResampledTEMP HUMIDITYWIND PLAYhothighweaknohothigh strongnohothighweak yesmildhighweak yescoolnormalweak yescoolnormal strongnocoolnormal strong yesmildhighweaknocoolnormalweak yesmildnormalweak yesmildnormal strong yesmildhigh strong yeshotnormalweak yesmildhigh nyrainsunnysunnyTEMP HUMIDITYWIND PLAYhothighweaknohothighweaknohothighweak yesmildhighweak yesmildhighweak yescoolnormal strongnocoolnormal strong yescoolnormal strong yescoolnormal strong yescoolnormal strong yesmildhighweaknomildnormalweak yesmildnormal strong yesmildnormal strong yesRun decision tree learningHUMIDITY: high OUTLOOK: rain,sunny TEMP: cool,hotyesnonoOUTLOOK: rainyesFriday, 27 January 12HUMIDITY: highOUTLOOK: rainWIND: strongyes

Example ercastrainResampledTEMP HUMIDITYWIND PLAYhothighweaknohothigh strongnohothighweak yesmildhighweak yescoolnormalweak yescoolnormal strongnocoolnormal strong yesmildhighweaknocoolnormalweak yesmildnormalweak yesmildnormal strong yesmildhigh strong yeshotnormalweak yesmildhigh strongnoOUTLOOK: rain,sunny TEMP: cool,hotyesnonoOUTLOOK: rainyesFriday, 27 January 12[one tree for eachresampleddata set]HUMIDITY: high HUMIDITY: high.OUTLOOK: rainWIND: strongyes

Back to baggingINPUT: D denotes training data, of size Nfor j from 1 . . . M doSample data Dj of size N from D with replacementTrain classifier hj on Djend forReturn a newclassifierhthatclassifiesnewexamplesxPMas h(x) j 1 hj (x)Friday, 27 January 12

random division of the data into and T is repeated 100 times and the reB are the averages over the 100 iterations. For the waveform data, 1800 newenerated at each iteration. Standard errors of #s and gB over the 100 iteratiocomputed.How bagging can helpgives the values of es, eB, and Table 3 their estimated standard errors.Table2. MisclassificationRates (%)Data Setwaveformheartbreast .14.95.9I 6%22%21%S: decision tree, B: baggingFriday, 27 January 12[Breiman, 1996]

Example: Glass data setStandard data set from UCI ML repository7 classes, such as: building windows vehicle windows headlampsRI1.517931.516431.51793.Friday, 27 January 12Na12.7912.1613.21Mg3.53.523.4810 features such as % Na by weight % Al by weight refractive indexAl1.121.351.41.CLASSbuilding (float)vehicle (float)building (float)

When to bag Bagging decision trees usually helps (but random forests, boosting better) Classifier needs to be unstable Bagging 1-nearest neighbour not so helpfulFriday, 27 January 12

Boosting Idea was to transform a “weak learner”into a strong one The only requirement for a weak learner isthat its accuracy is slightly above 50% (intwo-class case) Examples of weak learners: decision “stumps”, naive BayesFriday, 27 January 12

Ideas behind boosting Boosting is a general term for methods thattry to “amplify” a weak learner into abetter one. Rather than picking different training sets,reweight the training set Pick the weights based on which exampleswere misclassified previouslyFriday, 27 January 12

Weighted ExamplesUp until now, our data sets have beenD {xi , yi i 2 [1, N ]}Now we need to handle weighted data setsD {D(i), xi , yi i 2 [1, N ]}where D(i) is a distribution over instancesD(i)0NXD(i) 1i 1Most classifiers can handle this, no problem.(how would decision trees?)Friday, 27 January 12

AdaBoost.M1Given:InitializeForwhere.:Train base learner using distributionGet base classifier.Choose.Update:wheretion).,.is a normalization factor (chosen so thatOutput the final classifier:Figure 1: The boosting algorithm AdaBoost.Friday, 27 January 12will be a distribu-

Given:InitializeForwhere.:Updating weightsTrain base learner using distributionGet base classifier.Choose.Update:wheretion).,.is a normalization factor (chosen so thatOutput the final classifier:err NXi 1will be a distribu-Dt (i)I(yi 6 ht (xi ))Use the followingchoiceFigure 1: The“magic”boosting algorithmAdaBoost.2AdaBoost11 err t log2errWorking in Valiant’s PAC (probably approximately correct) learning model [75],Kearns and Valiant [41, 42] were the first to pose the question of whether a “weak”learning algorithm that performs just slightly better than random guessing can be“boosted” into an arbitrarily accurate “strong” learning algorithm. Schapire [66]Friday, 27 January 12

Ex: Boosted decision stumpsy1-1-111-1x1110001x2100010x3011110D1 (i) 1/6 for i 2 {1, 2, . . . 6}Let’s do the first iteration of boosting on this data setFriday, 27 January 12

Ex: Boosted decision stumpsy1-1-111-1x1110001x2100010x3011110h1x1 1x1 0p(y 1)1/32/3class-1 1D1 (i) 1/6 for i 2 {1, 2, . . . 6}Friday, 27 January 12

Ex: Boosted decision stumpsyX1-1X-111-1x1110001x2100010x3011110h1x1 1x1 0p(y 1)1/32/3class-1 1D1 (i) 1/6 for i 2 {1, 2, . . . 6}err 0.333311 err1 t log log 2 0.34652err2exp{ t } 1.414Friday, 27 January 12exp{ t } 0.707

Ex: Boosted decision stumpsyX1-1X-111-1x1110001x2100010x3011110exp{ 1 yi h1 (xi )}1.4140.7071.4140.7070.7070.707D1 (i)0.250.1250.250.1250.1250.125SUM 5.65exp{ t } 1.414Friday, 27 January 12exp{ t } 0.707

Ex: Iteration 2y1-1-111-1x1110001x2100010x3011110D1 (i)0.250.1250.250.1250.1250.125Now induce another decision stump,with examples weighted by D1e.g., if split on x1x1 1p(y 1) 0.5Friday, 27 January 12x1 00.5

errerr0-4-2alpha24 t log10.00.20.40.6pFriday, 27 January 120.81.0

3.0Loss ponentialBinomial DevianceSquared ErrorSupport Vector 2 10y·fFriday, 27 January 1212

Loss functionsCan view AdaBoost as approximatelyminimizing prediction errori.e., We want to learn an ensembleH(x) TX t ht (x)t 1that minimizes training errorNX1Err I(yi 6 H(xi ))N i 1with respect to { t }, {ht }Difficult to optimize directly (e.g., why not gradient descent?)so we approximate.Friday, 27 January 12

Loss functionsAdaBoost trick: Minimize upper bound on errorNNXX11I(yi 6 H(xi )) exp{ yi H(xi )}N i 1N i 1()NXX1 expyi t ht (xi )N i 1t: L( 1 , 2 , . . . , t , h1 , h2 , . . . ht ) YZtt(!!!)We still can’t minimise this exactly, so be greedy.Alternately minimise with respect to alpha and h t.Friday, 27 January 12

AdaBoost.M1Given:InitializeForwhere,.:Train base learner using distributionGet base classifier.Choose.Update:wheretion).Minimizes L wrt ht.is a normalization factor (chosen so thatOutput the final classifier:Figure 1: The boosting algorithm AdaBoost.2AdaBoostFriday, 27 January 1211 err t log2errMinimizes L wrt twill be a distribu-

0.5Boosting can help0.30.2244 Node Tree0.00.1Test Error0.4Single Stump0100200300400Boosting IterationsFriday, 27 January 12FIGURE 10.2. Simulated data (10.2): test error rate

Boosting can help30C4.52520151050051015202530boosting stumps051015202530boosting C4.5Figure 3: Comparison of C4.5 versus boosting stumps and boosting C4.5 on a setof 27Friday, 27 January12 benchmark problems as reported by Freund and Schapire [30]. Each point

On glass dataStandard DTBagged DTAdaBoostFriday, 27 January 1261%68%70%

Differences betweenbagging and boosting Bagging for unstable (i.e., high variance)classifiers Boosting often useful for biased classifiersas wellFriday, 27 January 12 Both improve performance Both need to choose base classifier Boosting typically performs better Both lose interpretability

Heterogeneousensembles Data mining competitions usually won byensemble methods (e.g., Netflix) Often the classifiers are completelyheterogeneousFriday, 27 January 12

Examinable reading (on Web site):Rob Shapire, The Boosting Approach to Machine Learning(Sections 4-8 not examinable)Leo Breiman, Bagging predictors, Machine Learning, 1996Friday, 27 January 12

Ensemble Methods Charles Sutton Data Mining and Exploration Spring 2012 Friday, 27 January 12. Bias and Variance Consider a regression problem . The random division of the data into and T is repeated 100 times and the reported #s, gB are the averages over the 100 iterations. For the waveform data, 1800 new cases