Transcription
Machine LearningandApplied EconometricsTree-Based Models4/23/2019Machine Learning and Econometrics1
Machine Learning and Econometrics This introductory lecture is based on– Kevin P. Murphy, Machine Learning A ProbabilisticPerspective, The MIT Press, 2017.– Darren Cook, Practical Machine Learning withH2O, O'Reilly Media, Inc., 2017.– Scott Burger, Introduction to Machine Learningwith R: Rigorous Mathematical Analysis, O’ReillyMedia, Inc., 2018.4/23/2019Machine Learning and Econometrics2
Supervised Machine Learning Regression-based Methods– Generalized Linear Models Linear Regression Logistic Regression– Deep Learning (Neural Nets) Tree-based Ensemble Methods– Random Forest (Bagging: Bootstrap Aggregation) Parallel ensemble to reduce variance– Gradient Boost Machine (Boosting) Sequential ensemble to reduce bias4/23/2019Machine Learning and Econometrics3
Tree-Based Models Random Forest (Bagging: BootstrapAggregation) Parallel ensemble to reduce variance Gradient Boost Machine (Boosting) Sequential ensemble to reduce bias4/23/2019Machine Learning and Econometrics4
Trees Classification Tree4/23/2019 Regression TreeMachine Learning and Econometrics5
Random Forest Random Forest is a bagging (bootstrapaggregation) of trees. Given a set of data, each of these trees inthe forest is a weak learner built on a subsetof rows (data observations) and columns(features or variables). More trees will reduce the variance, whichmay be processed in parallel.4/23/2019Machine Learning and Econometrics6
Random Forest4/23/2019Machine Learning and Econometrics7
Random Forest Modeling with H2O Basic Model– h2o.randomForest (x, y, training frame,model id NULL, seed -1, ) Model Specification Options– ntrees 50, max depth 20, mtries -1,– sample rate 0.632,– sample rate per class NULL,col sample rate change per level 1,col sample rate per tree 1,– min rows 1, nbins 20,– nbins top level 1024, nbins cats 1024,4/23/2019Machine Learning and Econometrics8
Random Forest Modeling with H2O Model Specification Options (Continued)– distribution c("AUTO", "bernoulli","multinomial", "gaussian", "poisson", "gamma","tweedie", "laplace", "quantile", "huber"),– histogram type c("AUTO", "UniformAdaptive","Random", "QuantilesGlobal", "RoundRobin"),– checkpoint NULL,4/23/2019Machine Learning and Econometrics9
Random Forest Modling with H2O Cross-Validation Parameters––––––validation frame NULL,nfolds 0, seed -1,keep cross validation models TRUE,keep cross validation predictions FALSE,keep cross validation fold assignment FALSE,fold assignment c("AUTO", "Random", "Modulo","Stratified"),– fold column NULL,4/23/2019Machine Learning and Econometrics10
Random Forest Modeling with H2O Early Stopping– stopping rounds 0,– stopping metric c("AUTO", "deviance","logloss", "MSE", "RMSE", "MAE", "RMSLE","AUC", "lift top group", "misclassification","mean per class error", "custom","custom increasing"),– stopping tolerance 0.001,– max runtime secs 0,4/23/2019Machine Learning and Econometrics11
Random Forest Modeling with H2O Other Important Control Parameters––––––––4/23/2019balance classes FALSE,class sampling factors NULL,max after balance size 5,max hit ratio k 0,min split improvement 1e-05binomial double trees FALSE,col sample rate change per level 1,col sample rate per tree 1,Machine Learning and Econometrics12
Gradient Boosting Machine Gradient Boosting Machine (GBM) is a forwardlearning ensemble method. It combinesgradient-based optimization and boosting.– Gradient-based optimization uses gradientcomputations to minimize a model’s loss functionin terms of the training data.– Boosting additively collects an ensemble of weakmodels to create a robust learning system forpredictive tasks.
Boosting4/23/2019Machine Learning and Econometrics14
Gradient Boosting Machine4/23/2019Machine Learning and Econometrics15
Gradient Boosting with H2O Basic Model– h2o.gbm (x, y, training frame, model id NULL,seed -1, ) Model Specification Options––––4/23/2019ntrees 50, max depth 5, min rows 10,nbins 20, nbins top level 1024, nbins cats 1024,learn rate 0.1, learn rate annealing 1,sample rate 1, sample rate per class NULL,col sample rate 1,col sample rate change per level 1,col sample rate per tree 1, max abs leaf,node pred Inf, )Machine Learning and Econometrics16
Gradient Boosting with H2O Model Specification Options (Continued)– distribution c("AUTO", "bernoulli","quasibinomial", "multinomial", "gaussian","poisson", "gamma", "tweedie", "laplace","quantile", "huber"),– quantile alpha 0.5,– tweedie power 1.5,– huber alpha 0.9,– checkpoint NULL4/23/2019Machine Learning and Econometrics17
Gradient Boosting with H2O Cross-Validation Parameters––––––validation frame NULL,nfolds 0, seed -1,keep cross validation models TRUE,keep cross validation predictions FALSE,keep cross validation fold assignment FALSE,fold assignment c("AUTO", "Random", "Modulo","Stratified"),– fold column NULL,4/23/2019Machine Learning and Econometrics18
Gradient Boosting with H2O Early Stopping– stopping rounds 0,– stopping metric c("AUTO", "deviance","logloss", "MSE", "RMSE", "MAE", "RMSLE","AUC", "lift top group", "misclassification","mean per class error", "custom","custom increasing"),– stopping tolerance 0.001,– max runtime secs 0,4/23/2019Machine Learning and Econometrics19
Gradient Boosting with H2O Other Important Control Parameters– min split improvement 1e-05– histogram type c("AUTO", "UniformAdaptive","Random", "QuantilesGlobal", "RoundRobin")4/23/2019Machine Learning and Econometrics20
Applied Econometrics Tree-Based Models . Random Forest Modeling with H2O Other Important Control Parameters -balance_classes FALSE, -class_sampling_factors NULL, . predictive tasks. Boosting 4/23/2019 Machine Learning and Econometrics 14 .