Machine Learning And Applied Econometrics - Portland State University PDF Free Download

1y ago

26 Views

1 Downloads

882.46 KB

20 Pages

Report/dmca

Download PDF

Transcription

Machine LearningandApplied EconometricsTree-Based Models4/23/2019Machine Learning and Econometrics1

Machine Learning and Econometrics This introductory lecture is based on– Kevin P. Murphy, Machine Learning A ProbabilisticPerspective, The MIT Press, 2017.– Darren Cook, Practical Machine Learning withH2O, O'Reilly Media, Inc., 2017.– Scott Burger, Introduction to Machine Learningwith R: Rigorous Mathematical Analysis, O’ReillyMedia, Inc., 2018.4/23/2019Machine Learning and Econometrics2

Supervised Machine Learning Regression-based Methods– Generalized Linear Models Linear Regression Logistic Regression– Deep Learning (Neural Nets) Tree-based Ensemble Methods– Random Forest (Bagging: Bootstrap Aggregation) Parallel ensemble to reduce variance– Gradient Boost Machine (Boosting) Sequential ensemble to reduce bias4/23/2019Machine Learning and Econometrics3

Tree-Based Models Random Forest (Bagging: BootstrapAggregation) Parallel ensemble to reduce variance Gradient Boost Machine (Boosting) Sequential ensemble to reduce bias4/23/2019Machine Learning and Econometrics4

Trees Classification Tree4/23/2019 Regression TreeMachine Learning and Econometrics5

Random Forest Random Forest is a bagging (bootstrapaggregation) of trees. Given a set of data, each of these trees inthe forest is a weak learner built on a subsetof rows (data observations) and columns(features or variables). More trees will reduce the variance, whichmay be processed in parallel.4/23/2019Machine Learning and Econometrics6

Random Forest4/23/2019Machine Learning and Econometrics7

Random Forest Modeling with H2O Basic Model– h2o.randomForest (x, y, training frame,model id NULL, seed -1, ) Model Specification Options– ntrees 50, max depth 20, mtries -1,– sample rate 0.632,– sample rate per class NULL,col sample rate change per level 1,col sample rate per tree 1,– min rows 1, nbins 20,– nbins top level 1024, nbins cats 1024,4/23/2019Machine Learning and Econometrics8

Random Forest Modeling with H2O Model Specification Options (Continued)– distribution c("AUTO", "bernoulli","multinomial", "gaussian", "poisson", "gamma","tweedie", "laplace", "quantile", "huber"),– histogram type c("AUTO", "UniformAdaptive","Random", "QuantilesGlobal", "RoundRobin"),– checkpoint NULL,4/23/2019Machine Learning and Econometrics9

Random Forest Modling with H2O Cross-Validation Parameters––––––validation frame NULL,nfolds 0, seed -1,keep cross validation models TRUE,keep cross validation predictions FALSE,keep cross validation fold assignment FALSE,fold assignment c("AUTO", "Random", "Modulo","Stratified"),– fold column NULL,4/23/2019Machine Learning and Econometrics10

Random Forest Modeling with H2O Early Stopping– stopping rounds 0,– stopping metric c("AUTO", "deviance","logloss", "MSE", "RMSE", "MAE", "RMSLE","AUC", "lift top group", "misclassification","mean per class error", "custom","custom increasing"),– stopping tolerance 0.001,– max runtime secs 0,4/23/2019Machine Learning and Econometrics11

Random Forest Modeling with H2O Other Important Control Parameters––––––––4/23/2019balance classes FALSE,class sampling factors NULL,max after balance size 5,max hit ratio k 0,min split improvement 1e-05binomial double trees FALSE,col sample rate change per level 1,col sample rate per tree 1,Machine Learning and Econometrics12

Gradient Boosting Machine Gradient Boosting Machine (GBM) is a forwardlearning ensemble method. It combinesgradient-based optimization and boosting.– Gradient-based optimization uses gradientcomputations to minimize a model’s loss functionin terms of the training data.– Boosting additively collects an ensemble of weakmodels to create a robust learning system forpredictive tasks.

Boosting4/23/2019Machine Learning and Econometrics14

Gradient Boosting Machine4/23/2019Machine Learning and Econometrics15

Gradient Boosting with H2O Basic Model– h2o.gbm (x, y, training frame, model id NULL,seed -1, ) Model Specification Options––––4/23/2019ntrees 50, max depth 5, min rows 10,nbins 20, nbins top level 1024, nbins cats 1024,learn rate 0.1, learn rate annealing 1,sample rate 1, sample rate per class NULL,col sample rate 1,col sample rate change per level 1,col sample rate per tree 1, max abs leaf,node pred Inf, )Machine Learning and Econometrics16

Gradient Boosting with H2O Model Specification Options (Continued)– distribution c("AUTO", "bernoulli","quasibinomial", "multinomial", "gaussian","poisson", "gamma", "tweedie", "laplace","quantile", "huber"),– quantile alpha 0.5,– tweedie power 1.5,– huber alpha 0.9,– checkpoint NULL4/23/2019Machine Learning and Econometrics17

Gradient Boosting with H2O Cross-Validation Parameters––––––validation frame NULL,nfolds 0, seed -1,keep cross validation models TRUE,keep cross validation predictions FALSE,keep cross validation fold assignment FALSE,fold assignment c("AUTO", "Random", "Modulo","Stratified"),– fold column NULL,4/23/2019Machine Learning and Econometrics18

Gradient Boosting with H2O Early Stopping– stopping rounds 0,– stopping metric c("AUTO", "deviance","logloss", "MSE", "RMSE", "MAE", "RMSLE","AUC", "lift top group", "misclassification","mean per class error", "custom","custom increasing"),– stopping tolerance 0.001,– max runtime secs 0,4/23/2019Machine Learning and Econometrics19

Gradient Boosting with H2O Other Important Control Parameters– min split improvement 1e-05– histogram type c("AUTO", "UniformAdaptive","Random", "QuantilesGlobal", "RoundRobin")4/23/2019Machine Learning and Econometrics20

Applied Econometrics Tree-Based Models . Random Forest Modeling with H2O Other Important Control Parameters -balance_classes FALSE, -class_sampling_factors NULL, . predictive tasks. Boosting 4/23/2019 Machine Learning and Econometrics 14 .