CSC411Fall2014 MachineLearning&DataMining% EnsembleMethods

Transcription

CSC411            Fall  2014Machine  Learning  &  Data  MiningEnsemble  MethodsSlides  by  Rich  Zemel

Simplestapproach:1. Generatemultipleclassi.iers2. Eachvotesontestinstance3. .ierwithoutrequiringanyfancynewalgorithm

dcombinationmethod1. . Sequentialtraining,iterativelyre- usesonhardexamples:boosting3. labor:mixtureofexpertsNotes: Alsoknownasmeta- ‐learning single- ‐nodedecisiontrees),orlinearclassi.iers

Variance- ‐biastradeoff?Minimizetwosetsoferrors:1. thetrainingset2. Bias:erroneousassumptionsinthemodelVariance- eerror(resultingfromtheproblemitself)

vations:1. ,bagging)- ‐- ‐reducesensitivitytoindividualdatapts2. g)

Ensemblemethods:Jus fica onEnsemble methods more accurate than any individualmembers: Accurate (better than guessing) Diverse (different errors on new examples)Independent errors: prob k of N classifiers (independenterror rate ε) istributionwheremorethanN/2wrong

lydifferentapproaches,ratherthanre ictor.”

Bootstrapes ma on RepeatedlydrawnsamplesfromD Foreachsetofsamples,estimateastatistic tes Usedtoestimateastatistic(parameter)anditsvariance Bagging:bootstrapaggregation(Breiman1994)

pointscompletely

Cross- mples:cross- ‐validatedcommittees: Herekdisjointsubsetsofdataareleftoutoftrainingsets Againusesmajorityforcombination

Boos componentclassi.iers

Makingweaklearnersstronger er”)thatcanalwaysget0.5 epsiloncorrectwhengivenatwo- ‐wayclassi.icationtask– Thatseemslikeaweakassumptionbutbeware! data?– ctivenewlearningprocedure(Freund&Shapire,1996)

Boos ng(ADAboost) hequalimportanceweightsoneachcase. Thenre- trainasecondmodel.– Howdowere- ‐weightthedata? Keeptrainingnewmodelsonre- ‐weighteddata estdata.– Howdoweweightthemodelsinthecommittee?

Boosting Probably one of the most influential ideas in machine learning in the lastdecade. In the PAC framework, boosting is a way of converting a “weak” learningmodel (behaves slightly better than chance) into a “strong” learning mode(behaves arbitrarily close to perfect). Strong theoretical result, but also lead to a very powerful and practicalalgorithm which is used all the time in real world machine learning. Basic idea, for binary classification with tn 1.where ym(x) are models trained with reweighted datasets Dm, and theweights m are non-negative.

How to train each classifier1 if error,0 if correct

How to weight each training case forclassifier mJmLet ε m m wnweighted errorrate of classifiern# 1 ε m &Let α m ln %( εm 'm 1nwmn wThis is the quality of theclassifier. It is zero if theclassifier has weighted errorrate of 0.5 and infinity if theclassifier is perfectexp { α m [ym (xn ) tn ] }

How to weight each training case forclassifier m Weight the binary prediction of eachclassifier by the quality of that classifier:

AdaBoost Algorithm Initialize the data weights wn 1/N. For m 1,.,M:- Fit a classifier ym(x) to the training data by minimizing theweighted error function:wherewhenis the indicator function and equals to oneand zero otherwise.- Evaluate:Weighting coefficients.weighted measures of theerror rates.

AdaBoost Algorithm Initialize the data weights wn 1/N. For m 1,.,M:- Fit a classifier ym(x) to the training data by minimizing:- Evaluate:- Update the data weights: Make predictions using the final model:

Some Intuitions The first classifier corresponds to the usual procedure for training a singleclassifier. At each round, boosting:- increases the weight on those examples the last classifier got wrong,- decreases the weight on those it got right. Over time, AdaBoost focuses on the examples that are consistently difficultand forgets about the ones that are consistently easy. The weight each intermediate classifier gets in the final ensemble dependson the error rate it achieved on its weighted training set at the time it wascreated. Hence the weighting coefficients m give greater weight to more accurateclassifiers.

Some Intuitions Schematic illustration of AdaBoost:

Exponential Loss One explanation, which helps a lot to understand how boosting reallyworks, is that classification boosting is equivalent to sequential minimizationof the following loss (error) function: This is called exponential loss and it is very similar to other kinds of loss,e.g. classification loss. Green: exponential Red: cross-entropy Blue: hinge loss Black: misclassifications error (0-1 loss)

Problem Setup Consider the exponential error:where fm(x) is a classifier defined an terms of linear combinationof base classifiers:and tn 1. Our goal is to minimize this objective with respect toparameters of the base classifiers and coefficients .

Boosting as Forward Additive Modeling Suppose that the base classifiers: y1(x), ,ym-1(x) and their coefficientsare fixed. We minimize only with respect to m and ym(x).fixedwhere we defined:optimizeRemember:

Boosting as Forward Additive Modeling Let A be the set of points that are correctly classified by ym(x), and B bethe set of points that are misclassified by ym(x). So minimizing with respect to ym(x) is equivalent to minimizing: and minimizing with respect to m leads to:

Updating the Weighting Coefficients The weights on the data points are updated as:Remember: Using the following identity:we obtain:Does not depend on n, justrescales all the weights This is the update performed by the AdaBoost.

Example Base learners are simple thresholds applied to one or another axis.

Animpressiveexampleofboos ng scannedacrossalargeimageto.indthefaces. ntensityintworectangularpiecesoftheimage.– ectangleinafewoperations. theyareveryfastatruntime.– lityontheweightedtrainingcases.

AdaBoostinfacedetec esTwotwistsonstandardalgorithm1) Pre- ‐de.ineweakclassi.iers,sooptimization selection2) sscostlythanmisses

AdaBoostfacedetec onresults

Whatarethebaseclassifiers?Popular choices of base classifier for boostingand other ensemble methods:– Linear classifiers– Decision trees31

Random/DecisionForests Definition: Ensemble of decision trees Algorithm:– Divide training examples into multiple trainingsets (bagging)– Train a decision tree on each set (can randomlyselect subset of variables to consider)– Aggregate the predictions of each tree to makeclassification decision (e.g., can choose modevote)32

Ensemblelearning:Boos ngandBagging Expertscooperatetopredictoutput pleExpert1Expert2g1y2 (x)g2 xy1 (x)ExpertMy(x) gm ym (x)mΣy(x)outputyM (x)33

MixtureofExperts- - rts)insteadofcooperationExpert1Expert2 xy(x) gm (x)ym (x)y1 (x)my2 (x)g1(x)g1 (x)Σy(x)gg22(x)(x)ExpertMGa&ngNetwork34

MixtureofExperts:Summary1. doutputindependently2. nofwhoisthetrueexpertforgiveninput3. Alloweachexperttoproducedistributionoveroutputs

Coopera onvs.Specializa on Considerregressionproblem betweenaverageofpredictorswithtarget1E (t ym )2M m lthantrainingeachpredictorseparately. eritsprediction

Coopera onvs.Specializa on feachpredictor’sdiscrepancywithtarget21E (t ym ) M m ngthat“expert”fortheparticulartrainingcase.21E gm (x)(t ym (x)) M m Gatingoutputissoftmaxofz Uxgm (x) exp(zm (x)) / exp(zm' (x))m'

Deriva vesofsimplecostfunc on Lookatderivativestoseewhatcostfunctionwilldo21E gm (x)(t ym (x)) M m rlessthanaverageerrorofexperts E2 gm (x)(t y m (x)) y mM E2 gm (x)[(t y m (x)) 2 E] zmM

MixtureofExperts:Finalcostfunc on enotjustsinglevalueestimate,butdistribution Resultisamixturemodel g (x)N(y y (x), Σ) log g (x)exp( t yp(y MOE) mmm log p(t MOE) m2(x) /2)mm xpert Egm (x)exp( t ym (x) 2 /2) 2(t ym (x))2 ym gm' (x)exp( t ym' (x) /2)m'

MixtureofExperts:SummaryA picture of two imaginary vowels and a mixture1. Costfunctiondesignedtomakeof two linear experts after ly2.decisionboundary ofGatingnetworksoftmaxoverexpert 1decisionboundary ofgating forgiveninputF23.decisionboundary ofexpert e expert 2on this sideuse expert 1on this sideF1

dcombinationmethod appingtrainingsets,averagetheirpredictions Sequentialtraining,iterativelyre- usesonhardexamples:boosting labor:mixtureofexpertsNotes: htingofcomponentsin.inalclassi.ier

d(1. Paralleltrainingwithdifferenttrainingsets: bagging((2. ing