Introduction To Machine Learning - UMIACS

Transcription

Introduction to Machine LearningComputational Linguistics: Jordan Boyd-GraberUniversity of MarylandLOGISTIC REGRESSION FROM TEXTSlides adapted from Emily FoxComputational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 1 / 18

Reminder: Logistic RegressionP (Y 0 X ) 1 P1 exp β0 i βi Xi Pexp β0 i βi Xi P (Y 1 X ) P1 exp β0 i βi Xi Discriminative prediction: p(y x ) Classification uses: ad placement, spam detection What we didn’t talk about is how to learn β from dataComputational Linguistics: Jordan Boyd-Graber UMD(1)(2)Introduction to Machine Learning 2 / 18

Logistic Regression: Objective FunctionL ln p(Y X , β ) Xln p(y (j ) x (j ) , β )(3)j Xj y(j )β0 Xi(j )βi xi ln 1 exp β0 X(j ) βi xii(4)Computational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 3 / 18

Logistic Regression: Objective FunctionL ln p(Y X , β ) Xln p(y (j ) x (j ) , β )(3)j Xj y(j )β0 Xi(j )βi xi ln 1 exp β0 X(j ) βi xii(4)Training data (y , x ) are fixed. Objective function is a function of β . . . whatvalues of β give a good value.Computational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 3 / 18

ConvexityComputational Linguistics: Jordan Boyd-Graber UMD Convex function Doesn’t matter where you start, ifyou “walk up” objectiveIntroduction to Machine Learning 4 / 18

ConvexityComputational Linguistics: Jordan Boyd-Graber UMD Convex function Doesn’t matter where you start, ifyou “walk up” objective Gradient!Introduction to Machine Learning 4 / 18

Gradient Descent (non-convex)GoalOptimize log likelihood with respect to variables βObjectiveParameterComputational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 5 / 18

Gradient Descent (non-convex)GoalOptimize log likelihood with respect to variables al Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 5 / 18

Gradient Descent (non-convex)GoalOptimize log likelihood with respect to variables nal Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 5 / 18

Gradient Descent (non-convex)GoalOptimize log likelihood with respect to variables nal Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 5 / 18

Gradient Descent (non-convex)GoalOptimize log likelihood with respect to variables onal Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 5 / 18

Gradient Descent (non-convex)GoalOptimize log likelihood with respect to variables onal Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 5 / 18

Gradient Descent (non-convex)GoalOptimize log likelihood with respect to variables ional Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 5 / 18

Gradient Descent (non-convex)GoalOptimize log likelihood with respect to variables ional Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 5 / 18

Gradient Descent (non-convex)GoalOptimize log likelihood with respect to variables tional Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 5 / 18

Gradient Descent (non-convex)GoalOptimize log likelihood with respect to variables βα J β0(l)β ijComputational Linguistics: Jordan Boyd-Graber UMD(l 1)β ijIntroduction to Machine Learning 5 / 18

Gradient Descent (non-convex)GoalOptimize log likelihood with respect to variables βα J β0(l)β ij(l 1)β ijLuckily, (vanilla) logistic regression is convexComputational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 5 / 18

Gradient for Logistic RegressionTo ease notation, let’s defineπi exp β T xi(5)1 exp β T xiOur objective function isL Xlog p(yi xi ) iComputational Linguistics: Jordan Boyd-GraberXi UMDLi X log πiiif yi 1log(1 πi ) if yi 0(6)Introduction to Machine Learning 6 / 18

Taking the DerivativeApply chain rule: L X Li (β ) X βj βjii(1 πiπi β j11 πiif yi 1 π βji if yi 0(7)If we plug in the derivative, πi πi (1 πi )xj , βj(8)we can merge these two cases Li (yi πi )xj . βjComputational Linguistics: Jordan Boyd-Graber UMD(9)Introduction to Machine Learning 7 / 18

Gradient for Logistic RegressionGradient L (β ) L (β ) β L (β ) ,., β0 βn (10)Update β η β L (β )βi0 βi ηComputational Linguistics: Jordan Boyd-Graber UMD L (β ) βi(11)(12)Introduction to Machine Learning 8 / 18

Gradient for Logistic RegressionGradient β L (β ) L (β ) L (β ),., β0 βn (10)Update β η β L (β )βi0 βi η(11) L (β ) βi(12)Why are we adding? What would well do if we wanted to do descent?Computational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 8 / 18

Gradient for Logistic RegressionGradient β L (β ) L (β ) L (β ),., β0 βn (10)Update β η β L (β )(11) L (β )βi0 βi η βi(12)η: step size, must be greater than zeroComputational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 8 / 18

Gradient for Logistic RegressionGradient β L (β ) L (β ) L (β ),., β0 βn (10)Update β η β L (β )βi0 βi η(11) L (β ) βi(12)NB: Conjugate gradient is usually better, but harder to implementComputational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 8 / 18

Choosing Step SizeObjectiveParameterComputational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 9 / 18

Choosing Step SizeObjectiveParameterComputational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 9 / 18

Choosing Step SizeObjectiveParameterComputational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 9 / 18

Choosing Step SizeObjectiveParameterComputational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 9 / 18

Choosing Step SizeObjectiveParameterComputational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 9 / 18

Remaining issues When to stop? What if β keeps getting bigger?Computational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 10 / 18

Regularized Conditional Log LikelihoodUnregularized β arg max ln p(y (j ) x (j ) , β )(13)βRegularizedX β arg max ln p(y (j ) x (j ) , β ) µβi2βComputational Linguistics: Jordan Boyd-Graber UMD(14)iIntroduction to Machine Learning 11 / 18

Regularized Conditional Log LikelihoodUnregularized β arg max ln p(y (j ) x (j ) , β )(13)βRegularizedX β arg max ln p(y (j ) x (j ) , β ) µβi2β(14)iµ is “regularization” parameter that trades off between likelihood andhaving small parametersComputational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 11 / 18

Approximating the Gradient Our datasets are big (to fit into memory) . . . or data are changing / streamingComputational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 12 / 18

Approximating the Gradient Our datasets are big (to fit into memory) . . . or data are changing / streaming Hard to compute true gradientL (β ) Ex [ L (β , x )] (15)Average over all observationsComputational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 12 / 18

Approximating the Gradient Our datasets are big (to fit into memory) . . . or data are changing / streaming Hard to compute true gradientL (β ) Ex [ L (β , x )](15) Average over all observations What if we compute an update just from one observation?Computational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 12 / 18

Getting to Union StationPretend it’s a pre-smartphone world and you want to get to Union StationComputational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 13 / 18

Stochastic Gradient for Logistic RegressionGiven a single observation xi chosen at random from the dataset, βj βj0 η µβj0 xij [yi πi ]Computational Linguistics: Jordan Boyd-Graber UMD(16)Introduction to Machine Learning 14 / 18

Stochastic Gradient for Logistic RegressionGiven a single observation xi chosen at random from the dataset, βj βj0 η µβj0 xij [yi πi ](16)Examples in class.Computational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 14 / 18

Stochastic Gradient for Regularized RegressionL log p(y x ; β ) µXβj2(17)jComputational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 15 / 18

Stochastic Gradient for Regularized RegressionL log p(y x ; β ) µXβj2(17)jTaking the derivative (with respect to example xi ) L (yi πi )xj 2µβj βjComputational Linguistics: Jordan Boyd-Graber UMD(18)Introduction to Machine Learning 15 / 18

Algorithm1. Initialize a vector B to be all zeros2. For t 1, . . . , T For each example x i , yi and feature j: Compute πi Pr(yi 1 x i ) Set β [j ] β [j ]0 λ(yi πi )xi3. Output the parameters β1 , . . . , βd .Computational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 16 / 18

Proofs about Stochastic Gradient Depends on convexity of objective and how close ε you want to get toactual answerBest bounds depend on changing η over time and per dimension (notall features created equal)Computational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 17 / 18

In class Your questions! Working through simple example Prepared for logistic regression homeworkComputational Linguistics: Jordan Boyd-Graber UMDIntroduction to Machine Learning 18 / 18

Introduction to Machine Learning Computational Linguistics: Jordan Boyd-Graber University of Maryland LOGISTIC REGRESSION FROM TEXT Slides adapted from Emily Fox Computational Linguistics: Jordan Boyd-Graber jUMD Introduction to Machine Learning 1 / 18. Reminder: Logistic Regression P(Y 0jX) 1 1 exp 0 P i iX i (1)