Ensemble Methods: Bagging And Boosting - IIT Kanpur PDF Free Download

1y ago

27 Views

1 Downloads

1.37 MB

77 Pages

Report/dmca

Download PDF

Transcription

Ensemble Methods: Bagging and BoostingPiyush RaiMachine Learning (CS771A)Oct 26, 2016Machine Learning (CS771A)Ensemble Methods: Bagging and Boosting1

Some Simple EnsemblesMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting2

Some Simple EnsemblesVoting or Averaging of predictions of multiple pre-trained modelsMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting2

Some Simple EnsemblesVoting or Averaging of predictions of multiple pre-trained models“Stacking”: Use predictions of multiple models as “features” to train a new model and use the newmodel to make predictions on test dataMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting2

Ensembles: Another ApproachInstead of training different models on same data, train same model multiple times on differentdata sets, and “combine” these “different” modelsMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting3

Ensembles: Another ApproachInstead of training different models on same data, train same model multiple times on differentdata sets, and “combine” these “different” modelsWe can use some simple/weak model as the base modelMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting3

Ensembles: Another ApproachInstead of training different models on same data, train same model multiple times on differentdata sets, and “combine” these “different” modelsWe can use some simple/weak model as the base modelHow do we get multiple training data sets (in practice, we only have one data set at training time)?Machine Learning (CS771A)Ensemble Methods: Bagging and Boosting3

BaggingBagging stands for Bootstrap AggregationMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting4

BaggingBagging stands for Bootstrap AggregationTakes original data set D with N training examplesMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting4

BaggingBagging stands for Bootstrap AggregationTakes original data set D with N training examplesCreates M copies {D̃m }Mm 1Machine Learning (CS771A)Ensemble Methods: Bagging and Boosting4

BaggingBagging stands for Bootstrap AggregationTakes original data set D with N training examplesCreates M copies {D̃m }Mm 1Each D̃m is generated from D by sampling with replacementEach data set D̃m has the same number of examples as in data set DThese data sets are reasonably different from each other (since only about 63% of the originalexamples appear in any of these data sets)Machine Learning (CS771A)Ensemble Methods: Bagging and Boosting4

BaggingBagging stands for Bootstrap AggregationTakes original data set D with N training examplesCreates M copies {D̃m }Mm 1Each D̃m is generated from D by sampling with replacementEach data set D̃m has the same number of examples as in data set DThese data sets are reasonably different from each other (since only about 63% of the originalexamples appear in any of these data sets)Train models h1 , . . . , hM using D̃1 , . . . , D̃M , respectivelyMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting4

BaggingBagging stands for Bootstrap AggregationTakes original data set D with N training examplesCreates M copies {D̃m }Mm 1Each D̃m is generated from D by sampling with replacementEach data set D̃m has the same number of examples as in data set DThese data sets are reasonably different from each other (since only about 63% of the originalexamples appear in any of these data sets)Train models h1 , . . . , hM using D̃1 , . . . , D̃M , respectivelyPMUse an averaged model h M1m 1 hm as the final modelMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting4

BaggingBagging stands for Bootstrap AggregationTakes original data set D with N training examplesCreates M copies {D̃m }Mm 1Each D̃m is generated from D by sampling with replacementEach data set D̃m has the same number of examples as in data set DThese data sets are reasonably different from each other (since only about 63% of the originalexamples appear in any of these data sets)Train models h1 , . . . , hM using D̃1 , . . . , D̃M , respectivelyPMUse an averaged model h M1m 1 hm as the final modelUseful for models with high variance and noisy dataMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting4

Bagging: illustrationTop: Original data, Middle: 3 models (from some model class) learned using three data sets chosen viabootstrapping, Bottom: averaged modelMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting5

Random ForestsAn ensemble of decision tree (DT) classifiersMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting6

Random ForestsAn ensemble of decision tree (DT) classifiersUses bagging on features (each DT will use a random set of features)Machine Learning (CS771A)Ensemble Methods: Bagging and Boosting6

Random ForestsAn ensemble of decision tree (DT) classifiersUses bagging on features (each DT will use a random set of features)Given a total of D features, each DT usesMachine Learning (CS771A) D randomly chosen featuresEnsemble Methods: Bagging and Boosting6

Random ForestsAn ensemble of decision tree (DT) classifiersUses bagging on features (each DT will use a random set of features)Given a total of D features, each DT uses D randomly chosen featuresRandomly chosen features make the different trees uncorrelatedAll DTs usually have the same depthMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting6

Random ForestsAn ensemble of decision tree (DT) classifiersUses bagging on features (each DT will use a random set of features)Given a total of D features, each DT uses D randomly chosen featuresRandomly chosen features make the different trees uncorrelatedAll DTs usually have the same depthEach DT will split the training data differently at the leavesMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting6

Random ForestsAn ensemble of decision tree (DT) classifiersUses bagging on features (each DT will use a random set of features)Given a total of D features, each DT uses D randomly chosen featuresRandomly chosen features make the different trees uncorrelatedAll DTs usually have the same depthEach DT will split the training data differently at the leavesPrediction for a test example votes on/averages predictions from all the DTsMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting6

BoostingThe basic ideaMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting7

BoostingThe basic ideaTake a weak learning algorithmMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting7

BoostingThe basic ideaTake a weak learning algorithmOnly requirement: Should be slightly better than randomMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting7

BoostingThe basic ideaTake a weak learning algorithmOnly requirement: Should be slightly better than randomTurn it into an awesome one by making it focus on difficult casesMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting7

BoostingThe basic ideaTake a weak learning algorithmOnly requirement: Should be slightly better than randomTurn it into an awesome one by making it focus on difficult casesMost boosting algoithms follow these steps:1Train a weak model on some training data2Compute the error of the model on each training exampleMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting7

BoostingThe basic ideaTake a weak learning algorithmOnly requirement: Should be slightly better than randomTurn it into an awesome one by making it focus on difficult casesMost boosting algoithms follow these steps:1Train a weak model on some training data2Compute the error of the model on each training example3Give higher importance to examples on which the model made mistakesMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting7

BoostingThe basic ideaTake a weak learning algorithmOnly requirement: Should be slightly better than randomTurn it into an awesome one by making it focus on difficult casesMost boosting algoithms follow these steps:1Train a weak model on some training data2Compute the error of the model on each training example3Give higher importance to examples on which the model made mistakes4Re-train the model using “importance weighted” training examplesMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting7

BoostingThe basic ideaTake a weak learning algorithmOnly requirement: Should be slightly better than randomTurn it into an awesome one by making it focus on difficult casesMost boosting algoithms follow these steps:1Train a weak model on some training data2Compute the error of the model on each training example3Give higher importance to examples on which the model made mistakes4Re-train the model using “importance weighted” training examples5Go back to step 2Machine Learning (CS771A)Ensemble Methods: Bagging and Boosting7

The AdaBoost AlgorithmGiven: Training data (x 1 , y1 ), . . . , (x N , yN ) with yn { 1, 1}, nMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting8

The AdaBoost AlgorithmGiven: Training data (x 1 , y1 ), . . . , (x N , yN ) with yn { 1, 1}, nInitialize weight of each example (x n , yn ): D1 (n) 1/N, nMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting8

The AdaBoost AlgorithmGiven: Training data (x 1 , y1 ), . . . , (x N , yN ) with yn { 1, 1}, nInitialize weight of each example (x n , yn ): D1 (n) 1/N, nFor round t 1 : TMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting8

The AdaBoost AlgorithmGiven: Training data (x 1 , y1 ), . . . , (x N , yN ) with yn { 1, 1}, nInitialize weight of each example (x n , yn ): D1 (n) 1/N, nFor round t 1 : TLearn a weak ht (x) { 1, 1} using training data weighted as per DtCompute the weighted fraction of errors of ht on this training data t Set “importance” of ht : αt NX1Dt (n) [ht (x n )n 11 t12 log( t )(gets larger as6 yn ] t gets smaller)Update the weight of each exampleDt 1 (n)Machine Learning (CS771A) (Dt (n) exp( αt )Dt (n) exp(αt )if ht (x n ) yn (correct prediction: decrease weight)if ht (x n ) 6 yn (incorrect prediction: increase weight)Ensemble Methods: Bagging and Boosting8

The AdaBoost AlgorithmGiven: Training data (x 1 , y1 ), . . . , (x N , yN ) with yn { 1, 1}, nInitialize weight of each example (x n , yn ): D1 (n) 1/N, nFor round t 1 : TLearn a weak ht (x) { 1, 1} using training data weighted as per DtCompute the weighted fraction of errors of ht on this training data t Set “importance” of ht : αt NX1Dt (n) [ht (x n )n 11 t12 log( t )(gets larger as6 yn ] t gets smaller)Update the weight of each exampleDt 1 (n)Machine Learning (CS771A) (Dt (n) exp( αt )Dt (n) exp(αt ) Dt (n) exp( αt yn ht (x n ))if ht (x n ) yn (correct prediction: decrease weight)if ht (x n ) 6 yn (incorrect prediction: increase weight)Ensemble Methods: Bagging and Boosting8

The AdaBoost AlgorithmGiven: Training data (x 1 , y1 ), . . . , (x N , yN ) with yn { 1, 1}, nInitialize weight of each example (x n , yn ): D1 (n) 1/N, nFor round t 1 : TLearn a weak ht (x) { 1, 1} using training data weighted as per DtCompute the weighted fraction of errors of ht on this training data t Set “importance” of ht : αt NX1Dt (n) [ht (x n )n 11 t12 log( t )(gets larger as6 yn ] t gets smaller)Update the weight of each exampleDt 1 (n) (Dt (n) exp( αt )Dt (n) exp(αt ) Dt (n) exp( αt yn ht (x n ))Normalize Dt 1 so that it sums to 1:Machine Learning (CS771A)if ht (x n ) yn (correct prediction: decrease weight)if ht (x n ) 6 yn (incorrect prediction: increase weight)Dt 1 (n) Dt 1 (n)PND(m)m 1 t 1Ensemble Methods: Bagging and Boosting8

The AdaBoost AlgorithmGiven: Training data (x 1 , y1 ), . . . , (x N , yN ) with yn { 1, 1}, nInitialize weight of each example (x n , yn ): D1 (n) 1/N, nFor round t 1 : TLearn a weak ht (x) { 1, 1} using training data weighted as per DtCompute the weighted fraction of errors of ht on this training data t Set “importance” of ht : αt NX1Dt (n) [ht (x n )n 11 t12 log( t )(gets larger as6 yn ] t gets smaller)Update the weight of each exampleDt 1 (n) (Dt (n) exp( αt )Dt (n) exp(αt ) Dt (n) exp( αt yn ht (x n ))Normalize Dt 1 so that it sums to 1:if ht (x n ) yn (correct prediction: decrease weight)if ht (x n ) 6 yn (incorrect prediction: increase weight)Dt 1 (n) Dt 1 (n)PND(m)m 1 t 1PTOutput the “boosted” final hypothesis H(x) sign( t 1 αt ht (x))Machine Learning (CS771A)Ensemble Methods: Bagging and Boosting8

AdaBoost: ExampleConsider binary classification with 10 training examplesInitial weight distribution D1 is uniform (each point has equal weight 1/10)Each of our weak classifers will be an axis-parallel linear classifierMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting9

After Round 1Error rate of h1 : 1 0.3; weight of h1 : α1 12ln((1 1 )/ 1 ) 0.42Each misclassified point upweighted (weight multiplied by exp(α2 ))Each correctly classified point downweighted (weight multiplied by exp( α2 ))Machine Learning (CS771A)Ensemble Methods: Bagging and Boosting10

After Round 2Error rate of h2 : 2 0.21; weight of h2 : α2 12ln((1 2 )/ 2 ) 0.65Each misclassified point upweighted (weight multiplied by exp(α2 ))Each correctly classified point downweighted (weight multiplied by exp( α2 ))Machine Learning (CS771A)Ensemble Methods: Bagging and Boosting11

After Round 3Error rate of h3 : 3 0.14; weight of h3 : α3 12ln((1 3 )/ 3 ) 0.92Suppose we decide to stop after round 3Our ensemble now consists of 3 classifiers: h1 , h2 , h3Machine Learning (CS771A)Ensemble Methods: Bagging and Boosting12

Final ClassifierFinal classifier is a weighted linear combination of all the classifiersClassifier hi gets a weight αiMultiple weak, linear classifiers combined to give a strong, nonlinear classifierMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting13

Another ExampleGiven: A nonlinearly separable datasetWe want to use Perceptron (linear classifier) on this dataMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting14

AdaBoost: Round 1After round 1, our ensemble has 1 linear classifier (Perceptron)Bottom figure: X axis is number of rounds, Y axis is training errorMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting15

AdaBoost: Round 2After round 2, our ensemble has 2 linear classifiers (Perceptrons)Bottom figure: X axis is number of rounds, Y axis is training errorMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting16

AdaBoost: Round 3After round 3, our ensemble has 3 linear classifiers (Perceptrons)Bottom figure: X axis is number of rounds, Y axis is training errorMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting17

AdaBoost: Round 4After round 4, our ensemble has 4 linear classifiers (Perceptrons)Bottom figure: X axis is number of rounds, Y axis is training errorMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting18

AdaBoost: Round 5After round 5, our ensemble has 5 linear classifiers (Perceptrons)Bottom figure: X axis is number of rounds, Y axis is training errorMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting19

AdaBoost: Round 6After round 6, our ensemble has 6 linear classifiers (Perceptrons)Bottom figure: X axis is number of rounds, Y axis is training errorMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting20

AdaBoost: Round 7After round 7, our ensemble has 7 linear classifiers (Perceptrons)Bottom figure: X axis is number of rounds, Y axis is training errorMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting21

AdaBoost: Round 40After round 40, our ensemble has 40 linear classifiers (Perceptrons)Bottom figure: X axis is number of rounds, Y axis is training errorMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting22

Boosted Decision Stumps Linear ClassifierA decision stump (DS) is a tree with a single node (testing the value of a single feature, say thed-th feature)Machine Learning (CS771A)Ensemble Methods: Bagging and Boosting23

Boosted Decision Stumps Linear ClassifierA decision stump (DS) is a tree with a single node (testing the value of a single feature, say thed-th feature)Suppose each example x has D binary features {xd }Dd 1 , with xd {0, 1} and the label y is alsobinary, i.e., y { 1, 1}The DS (assuming it tests the d-th feature) will predict the label as ash(x) sd (2xd 1)where s { 1, 1}Suppose we have T such decision stumps h1 , . . . , hT , testing feature number i1 , . . . , iT ,respectively, i.e., ht (x) sit (2xit 1)Machine Learning (CS771A)Ensemble Methods: Bagging and Boosting23

Boosted Decision Stumps Linear ClassifierA decision stump (DS) is a tree with a single node (testing the value of a single feature, say thed-th feature)Suppose each example x has D binary features {xd }Dd 1 , with xd {0, 1} and the label y is alsobinary, i.e., y { 1, 1}The DS (assuming it tests the d-th feature) will predict the label as ash(x) sd (2xd 1)where s { 1, 1}Suppose we have T such decision stumps h1 , . . . , hT , testing feature number i1 , . . . , iT ,respectively, i.e., ht (x) sit (2xit 1)PTThe boosted hypothesis H(x) sgn( t 1 αt ht (x)) can be written asH(x) sgn(TXt 1Machine Learning (CS771A)αit sit (2xit 1)) sgn(TX2αit sit xit TXt 1Ensemble Methods: Bagging and Boosting αit sit ) sign(w x b)t 123

Boosted Decision Stumps Linear ClassifierA decision stump (DS) is a tree with a single node (testing the value of a single feature, say thed-th feature)Suppose each example x has D binary features {xd }Dd 1 , with xd {0, 1} and the label y is alsobinary, i.e., y { 1, 1}The DS (assuming it tests the d-th feature) will predict the label as ash(x) sd (2xd 1)where s { 1, 1}Suppose we have T such decision stumps h1 , . . . , hT , testing feature number i1 , . . . , iT ,respectively, i.e., ht (x) sit (2xit 1)PTThe boosted hypothesis H(x) sgn( t 1 αt ht (x)) can be written asH(x) sgn(TXαit sit (2xit 1)) sgn(t 1where wd PMachine Learning (CS771A)t:it d2αt st and b TX2αit sit xit TXt 1Pt αit sit ) sign(w x b)t 1αt stEnsemble Methods: Bagging and Boosting23

Boosting: Some CommentsFor AdaBoost, given each model’s error t 1/2 γt , the training error consistently gets betterTXwith roundstrain-error(Hfinal ) exp( 2γt2 )t 1Machine Learning (CS771A)Ensemble Methods: Bagging and Boosting24

Boosting: Some CommentsFor AdaBoost, given each model’s error t 1/2 γt , the training error consistently gets betterTXwith roundstrain-error(Hfinal ) exp( 2γt2 )t 1Boosting algorithms can be shown to be minimizing a loss functionE.g., AdaBoost has been shown to be minimizing an exponential lossL NXexp{ yn H(x n )}n 1Pwhere H(x) sign( Tt 1 αt ht (x)), given weak base classifiers h1 , . . . , hTMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting24

Boosting: Some CommentsFor AdaBoost, given each model’s error t 1/2 γt , the training error consistently gets betterTXwith roundstrain-error(Hfinal ) exp( 2γt2 )t 1Boosting algorithms can be shown to be minimizing a loss functionE.g., AdaBoost has been shown to be minimizing an exponential lossL NXexp{ yn H(x n )}n 1Pwhere H(x) sign( Tt 1 αt ht (x)), given weak base classifiers h1 , . . . , hTBoosting in general can perform badly if some examples are outliersMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting24

Bagging vs BoostingNo clear winner; usually depends on the dataMachine Learning (CS771A)Ensemble Methods: Bagging and Boosting25

Bagging vs BoostingNo clear winner; usually depends on the dataBagging is computationally more efficient than boosting (note that bagging can train the Mmodels in parallel, boosting can’t)Both reduce variance (and overfitting) by combining different modelsThe resulting model has higher stability as compared to the individual onesBagging usually can’t reduce the bias, boosting can (note that in boosting, the training errorsteadily decreases)Machine Learning (CS771A)Ensemble Methods: Bagging and Boosting25

Bagging vs BoostingNo clear winner; usually depends on the dataBagging is computationally more efficient than boosting (note that bagging can train the Mmodels in parallel, boosting can’t)Both reduce variance (and overfitting) by combining different modelsThe resulting model has higher stability as compared to the individual onesBagging usually can’t reduce the bias, boosting can (note that in boosting, the training errorsteadily decreases)Bagging usually performs better than boosting if we don’t have a high bias and only want toreduce variance (i.e., if we are overfitting)Machine Learning (CS771A)Ensemble Methods: Bagging and Boosting25

model to make predictions on test data Machine Learning (CS771A) Ensemble Methods: Bagging and Boosting 2. Some Simple Ensembles Voting or Averaging of predictions of multiple pre-trained models . Ensemble Methods: Bagging and Boosting 6. Random Forests An ensemble of decision tree (DT) classi ers. Boosting ]: ( )) PN ]: ( )) PN .