AN INTRODUCTION TO MACHINE LEARNING PDF Free Download

2y ago

99 Views

1 Downloads

2.40 MB

43 Pages

Report/dmca

Download PDF

Transcription

MICHAEL CLARKCENTER FOR SOCIAL RESEARCHUNIVERSITY OF NOTRE DAMEAN INTRODUCTION TO MACHINE LEARNINGW I T H A P P L I C AT I O N S I N R

Machine LearningContentsPreface5Introduction: Explanation & Prediction7Some Terminology7Tools You Already Have7The Standard Linear Model8Logistic Regression9Expansions of Those Tools9Generalized Linear ModelsGeneralized Additive Models10The Loss Function10Continuous OutcomesSquared Error10Absolute Error1010Negative Log-likelihoodR Example1111Categorical Outcomes11MisclassificationBinomial log-likelihood12ExponentialHinge Loss1212RegularizationR Example1311962

3Applications in R14Bias-Variance Tradeoff14Bias & Variance15The TradeoffDiagnosing Bias-Variance Issues & Possible Solutions16Worst Case Scenario16High Variance16High Bias16Cross-ValidationAdding Another Validation SetK-fold Cross-Validation1717Leave-one-out Cross-ValidationBootstrap161718Other Stuff18Model Assessment & Selection18Beyond Classification Accuracy: Other Measures of Performance20Process Overview20Data PreparationDefine Data and Data Partitions21Feature Scaling21Feature Engineering21Discretization22Model SelectionModel Assessment2222Opening the Black BoxThe Dataset23R Implementation242018

Machine Learning24Feature Selection & The Data Partition25k-nearest Neighbors27Strengths & Weaknesses28Neural Nets30Strengths & Weaknesses30Trees & Forests33Strengths & Weaknesses33Support Vector Machines35Strengths & Weaknesses35OtherUnsupervised Learning3535Clustering36Latent Variable Models36Graphical 7Stacking38Feature Selection & ImportanceTextual Analysis3939Bayesian ApproachesMore StuffSummary4040Cautionary NotesSome GuidelinesConclusion38404041Brief Glossary of Common Terms424

5Applications in RPrefaceThe purpose of this document is to provide a conceptual introductionto statistical or machine learning (ML) techniques for those that mightnot normally be exposed to such approaches during their requiredtypical statistical training1 . Machine learning2 can be described asa form of a statistics, often even utilizing well-known nad familiartechniques, that has bit of a different focus than traditional analyticalpractice in the social sciences and other disciplines. The key notion isthat flexible, automatic approaches are used to detect patterns withinthe data, with a primary focus on making predictions on future data.If one surveys the number of techniques available in ML withoutcontext, it will surely be overwhelming in terms of the sheer numberof those approaches and also the various tweaks and variations ofthem. However, the specifics of the techniques are not as importantas more general concepts that would be applicable in most every MLsetting, and indeed, many traditional ones as well. While there will beexamples using the R statistical environment and descriptions of a fewspecific approaches, the focus here is more on ideas than application3and kept at the conceptual level as much as possible. However, someapplied examples of more common techniques will be provided indetail.As for prerequisite knowledge, I will assume a basic familiarity withregression analyses typically presented to those in applied disciplines,particularly those of the social sciences. Regarding programming, oneshould be at least somewhat familiar with using R and Rstudio, andeither of my introductions here and here will be plenty. Note that Iwon’t do as much explaining of the R code as in those introductions,and in some cases I will be more concerned with getting to a resultthan clearly detailing the path to it. Armed with such introductoryknowledge as can be found in those documents, if there are parts ofR code that are unclear one would have the tools to investigate anddiscover for themselves the details, which results in more learninganyway.1I generally have in mind social scienceresearchers but hopefully keep thingsgeneral enough for other disciplines.2Also referred to as applied statisticallearning, statistical engineering, datascience or data mining in other contexts.3Indeed, there is evidence that withlarge enough samples many techniquesconverge to similar performance.The latest version of this documentis dated May 2, 2013 (original March2013).

Machine LearningIntroduction: Explanation &PredictionF O R A N Y PA RT I C U L A R A N A LY S I S C O N D U C T E D , emphasis can beplaced on understanding the underlying mechanisms which have specific theoretical underpinnings, versus a focus that dwells more onperformance and, more to the point, future performance. These are notmutually exclusive goals in the least, and probably most studies contain a little of both in some form or fashion. I will refer to the formeremphasis as that of explanation, and the latter that of prediction.In studies with a more explanatory focus, traditionally analysis concerns a single data set. For example, one assumes a data generatingdistribution for the response, and one evaluates the overall fit of asingle model to the data at hand, e.g. in terms of R-squared, and statistical significance for the various predictors in the model. One assesseshow well the model lines up with the theory that led to the analysis,and modifies it accordingly, if need be, for future studies to consider.Some studies may look at predictions for specific, possibly hypotheticalvalues of the predictors, or examine the particular nature of individualpredictors effects. In many cases, only a single model is considered.In general though, little attempt is made to explicitly understand howwell the model will do with future data, but we hope to have gainedgreater insight as to the underlying mechanisms guiding the responseof interest. Following Breiman (2001), this would be more akin to thedata modeling culture.For the other type of study focused on prediction, newer techniquesare available that are far more focused on performance, not only forthe current data under examination but for future data the selectedmodel might be applied to. While still possible, relative predictor importance is less of an issue, and oftentimes there may be no particulartheory to drive the analysis. There may be thousands of input variables, such that no simple summary would likely be possible anyway.However, many of the techniques applied in such analyses are quitepowerful, and steps are taken to ensure better results for new data.Again referencing Breiman (2001), this perspective is more of the algorithmic modeling culture.While the two approaches are not exclusive, I present two extremeviews of the situation:To paraphrase provocatively, ’machine learning is statistics minus anychecking of models and assumptions’. Brian Ripley, 2004. the focus in the statistical community on data models has:Led to irrelevant theory and questionable scientific conclusions.6

7Applications in RKept statisticians from using more suitable algorithmic models.Prevented statisticians from working on exciting new problems. LeoBrieman, 2001Respective departments of computer science and statistics now overlap more than ever as more relaxed views seem to prevail today, butthere are potential drawbacks to placing too much emphasis on eitherapproach historically associated with them. Models that ’just work’have the potential to be dangerous if they are little understood. Situations for which much time is spent sorting out details for an ill-fittingmodel suffers the converse problem- some (though often perhaps verylittle actually) understanding with little pragmatism. While this paperwill focus on more algorithmic approaches, guidance will be providedwith an eye toward their use in situations where the typical data modeling approach would be applied, thereby hopefully shedding somelight on a path toward obtaining the best of both worlds.Some TerminologyFor those used to statistical concepts such as dependent variables,clustering, and predictors, etc. you will have to get used to some differences in terminology4 such as targets, unsupervised learning, andinputs etc. This doesn’t take too much, even if it is somewhat annoyingwhen one is first starting out. I won’t be too beholden to either in thispaper, and it should be clear from the context what’s being referred to.Initially I will start off mostly with non-ML terms and note in bracketsit’s ML version to help the orientation along.Tools You Already HaveO N E T H I N G T H AT I S I M P O RTA N T T O K E E P I N M I N D A S Y O U B E G I N isthat standard techniques are still available, although we might tweakthem or do more with them. So having a basic background in statisticsis all that is required to get started with machine learning.The Standard Linear ModelAll introductory statistics courses will cover linear regression in greatdetail, and it certainly can serve as a starting point here. We can describe it as follows in matrix notation:y N (µ, σ2 )µ Xβ4See this for a comparison

Machine LearningWhere y is a normally distributed vector of responses [target] withmean µ and constant variance σ2 . X is a typical model matrix, i.e. amatrix of predictor variables and in which the first column is a vector of 1s for the intercept [bias5 ], and β is the vector of coefficients[weights] corresponding to the intercept and predictors in the model.What might be given less focus in applied courses however is howoften it won’t be the best tool for the job or even applicable in the formit is presented. Because of this many applied researchers are still hammering screws with it, even as the explosion of statistical techniquesof the past quarter century has rendered obsolete many current introductory statistical texts that are written for disciplines. Even so, theconcepts one gains in learning the standard linear model are generalizable, and even a few modifications of it, while still maintaining thebasic design, can render it still very effective in situations where it isappropriate.Typically in fitting [learning] a model we tend to talk about Rsquared and statistical significance of the coefficients for a smallnumber of predictors. For our purposes, let the focus instead be onthe residual sum of squares6 with an eye towards its reduction andmodel comparison. We will not have a situation in which we are onlyconsidering one model fit, and so must find one that reduces the sumof the squared errors but without unnecessary complexity and overfitting, concepts we’ll return to later. Furthermore, we will be much moreconcerned with the model fit on new data [generalization].85Yes, you will see ’bias’ refer to anintercept, and also mean somethingentirely different in our discussion ofbias vs. variance. (y f ( x ))2 where f ( x ) is a functionof the model predictors, and in thiscontext a linear combination of them(Xβ).6Logistic RegressionLogistic regression is often used where the response is categorical innature, usually with binary outcome in which some event occurs ordoes not occur [label]. One could still use the standard linear modelhere, but you could end up with nonsensical predictions that fall outside the 0-1 range regarding the probability of the event occurring, togo along with other shortcomings. Furthermore, it is no more effortnor is any understanding lost in using a logistic regression over thelinear probability model. It is also good to keep logistic regression inmind as we discuss other classification approaches later on.Logistic regression is also typically covered in an introduction tostatistics for applied disciplines because of the pervasiveness of binaryresponses, or responses that have been made as such7 . Like the standard linear model, just a few modifications can enable one to use it toprovide better performance, particularly with new data. The gist is,it is not the case that we have to abandon familiar tools in the movetoward a machine learning perspective.7It is generally a bad idea to discretizecontinuous variables, especially thedependent variable. However contextualissues, e.g. disease diagnosis, mightwarrant it.

9Applications in RExpansions of Those ToolsGeneralized Linear ModelsTo begin, logistic regression is a generalized linear model assuming abinomial distribution for the response and with a logit link function asfollows:y Bin(µ, size 1)η g(µ)η XβThis is the same presentation format as seen with the standard linear model presented before, except now we have a link function g(.)and so are dealing with a transformed response. In the case of thestandard linear model, the distribution assumed is the gaussian andthe link function is the identity link, i.e. no transformation is made.The link function used will depend on the analysis performed, andwhile there is choice in the matter, the distributions used have a typical, or canonical link function8 .Generalized linear models expand the standard linear model, whichis a special case of generalized linear model, beyond the gaussiandistribution for the response, and allow for better fitting models ofcategorical, count, and skewed response variables. We have also have acounterpart to the residual sum of squares, though we’ll now refer to itas the deviance.Generalized Additive ModelsAdditive models extend the generalized linear model to incorporatenonlinear relationships of predictors to the response. We might note itas follows:y f amily(µ, .)η g(µ)η Xβ f ( X )So we have the generalized linear model but also smooth functionsf ( X ) of one or more predictors. More detail can be found in Wood(2006) and I provide an introduction here.Things do start to get fuzzy with GAMs. It becomes more difficultto obtain statistical inference for the smoothed terms in the model,and the nonlinearity does not always lend itself to easy interpretation.However really this just means that we have a little more work to getthe desired level of understanding. GAMs can be seen as a segue toward more black box/algorithmic techniques. Compared to some ofthose techniques in machine

Machine learning2 can be described as 1 I generally have in mind social science researchers but hopefully keep things general enough for other disciplines. 2 Also referred to as applied statistical learning, statistical engineering, data science or data mining in other contexts. a form of a statistics, often even utilizing well-known nad familiar techniques, that has bit of a different focus .File Size: 2MBPage Count: 43