Ml On-line Learning - New York University

Transcription

Foundations of Machine LearningOn-Line LearningMehryar MohriCourant Institute and Google Researchmohri@cims.nyu.edu

MotivationPAC learning:distribution fixed over time (training and test).IID assumption. On-line learning:no distributional assumption.worst-case analysis (adversarial).mixed training and test.Performance measure: mistake model, regret. Mehryar Mohri - Foundations of Machine Learningpage 2

This LecturePrediction with expert adviceLinear classificationMehryar Mohri - Foundations of Machine Learningpage3

General On-Line SettingFor t 1 to T doreceive instance xt X.predict yt Y.receive label yt Y.incur loss L(yt , yt ). Classification: Y {0, 1}, L(y, y ) yRegression: Y R, L(y, y ) (yObjective: minimize total lossMehryar Mohri - Foundations of Machine Learningy .y)2 .Tt 1L( yt , yt ).page 4

Prediction with Expert AdviceFor t 1 to T doreceive instance xt X and advice yt,i Y, i [1, N ].predict yt Y.receive label yt Y.incur loss L(yt , yt ). Objective: minimize regret, i.e., difference of totalloss incurred and that of best expert.Regret(T ) TXt 1Mehryar Mohri - Foundations of Machine LearningL(byt , yt )Nmini 1TXL(yt,i , yt ).t 1page 5

Mistake Bound ModelDefinition: the maximum number of mistakes alearning algorithm L makes to learn c is defined byML (c) max mistakes(L, c) .x1 ,.,xTDefinition: for any concept class C the maximumnumber of mistakes a learning algorithm L makes isML (C) max ML (c).c CA mistake bound is a bound M on ML (C) .Mehryar Mohri - Foundations of Machine Learningpage 6

Halving Algorithmsee (Mitchell, 1997)Halving(H)1 H1H2 for t1 to T do3Receive(xt )4ytMajorityVote(Ht , xt )5Receive(yt )6if yt yt then7Ht 1{c Ht : c(xt ) yt }8 return HT 1Mehryar Mohri - Foundations of Machine Learningpage 7

Halving Algorithm - Bound(Littlestone, 1988)Theorem: Let H be a finite hypothesis set, thenMHalving(H)log2 H .Proof: At each mistake, the hypothesis set isreduced at least by half.Mehryar Mohri - Foundations of Machine Learningpage 8

VC Dimension Lower Bound(Littlestone, 1988)Theorem: Let opt(H) be the optimal mistake boundfor H . Then,VCdim(H)opt(H)MHalving(H)log2 H .Proof: for a fully shattered set, form a completebinary tree of the mistakes with height VCdim(H) .Mehryar Mohri - Foundations of Machine Learningpage 9

Weighted Majority Algorithm(Littlestone and Warmuth, 1988)Weighted-Majority(N experts)1 for i1 to N do2w1,i13 for t1 to T do4Receive(xt )PN5yt1PNy 1 wty 0 wtt,iweighted majority votet,i6Receive(yt )7if yt yt then8for i1 to N do9if (yt,i yt ) then10wt 1,iwt,i11else wt 1,iwt,i12 return wT 1Mehryar Mohri - Foundations of Machine Learningyt , yt,i {0, 1}.[0, 1).page 10

Weighted Majority - BoundTheorem: Let mt be the number of mistakes madeby the WM algorithm till time t and mt that of thebest expert. Then, for all t ,mtlog N mt loglog21 Thus, m O(log N ) constant Realizable case: m O(log N ). Halving algorithm: 0 .t1.best expert.tMehryar Mohri - Foundations of Machine Learningpage 11

Weighted Majority - ProofPotential:tNi 1 wt,i .Upper bound: after each error, t 1Thus,t 1212 1 2 mt t1 2N.Lower bound: for any expert i ,Comparison:mt1 2mt logmt logMehryar Mohri - Foundations of Machine Learningt.mttwt,i Nlog N mt log21 mt,i1 2log N mt log 1 .page 12.

Weighted Majority - NotesAdvantage: remarkable bound requiring noassumption.Disadvantage: no deterministic algorithm canachieve a regret RT o(T ) with the binary loss.better guarantee with randomized WM.better guarantee for WM with convex losses. Mehryar Mohri - Foundations of Machine Learningpage 13

Exponential Weighted Averagetotal loss incurred byexpert i up to time tAlgorithm:weight update: wt 1,i wt,i ePNi 1 wt,i yt,iP.prediction: yt Nwt,i L(yt,i ,yt ) e Lt,i.i 1Theorem: assume that L is convex in its firstargument and takes values in [0, 1] . Then, for any 0and any sequence y1 , . . . , yT Y , the regret at Tsatisfieslog NTRegret(T )For 8.8 log N/T ,Regret(T )Mehryar Mohri - Foundations of Machine Learning(T /2) log N .page 14

Exponential Weighted Avg - ProofPotential:t logNi 1wt,i .Upper bound:tt 1 logPNi 1wt 1,i e L(yt,i ,yt )PNi 1 wt 1,iE [e L(yt,i ,yt ) ]wt 1 log E exp L(yt,i , yt ) logwt 1 2 E [L(yt,i , yt )] wt 18 2 L( E [yt,i ], yt ) wt 18 2 L(byt , yt ) .8Mehryar Mohri - Foundations of Machine LearningE [L(yt,i , yt )]wt1 E [L(yt,i , yt )]wt(Hoeffding’s ineq.)(convexity of first arg. of L)page 151

Exponential Weighted Avg - ProofUpper bound: summing up the inequalities yieldsTT0T.L( yt , yt ) 8t 1 Lower bound:NT0 logeLT ,ii 1 min LT,ilog N i 1L( yt , yt )t 1Mehryar Mohri - Foundations of Machine LearningNmin LT,i i 1LT ,ii 1Nmin LT,ii 1TN Nlog N log max eComparison:T22TL( yt , yt ) 8t 1log NT .8page 16log Nlog N.

Exponential Weighted Avg - NotesAdvantage: bound on regret per bound is of thelog(N )form RTT O.TDisadvantage: choice ofhorizon T .Mehryar Mohri - Foundations of Machine Learningrequires knowledge ofpage 17

Doubling TrickIdea: divide time into periods [2k , 2k 1 1] of length 2k8 log Nnwith k 0, . . . , n, T 2 1, and choose k 2kin each period.Theorem: with the same assumptions as before, forany T , the following holds: 2Regret(T ) 2 1Mehryar Mohri - Foundations of Machine Learning(T /2) log N log N/2.page 18

Doubling Trick - ProofBy the previous theorem, for any Ik [2k , 2k 1 1] ,LIkThus, LTn Nmin LIk ,i i 1nnNmin LIk ,i LIkk 02k /2 log N.k 0Ni 1nwith 2(log N )/2.k2min LT,i i 1k 0k 0 n 1 (n 1)/2k12122 T 1 22 2 12 12 1i 0nMehryar Mohri - Foundations of Machine Learning 2k (log N )/2 12( T 1) 2 1page 19 12 T 1.2 1

NotesDoubling trick used in a variety of other contextsand proofs.More general method, learning parameter functionof time: t (8 log N )/t . Constant factorimprovement:Regret(T )Mehryar Mohri - Foundations of Machine Learning2 (T /2) log N (1/8) log N.page 20

This LecturePrediction with expert adviceLinear classificationMehryar Mohri - Foundations of Machine Learningpage21

Perceptron Algorithm(Rosenblatt, 1958)Perceptron(w0 )1 w1w0typically w0 02 for t1 to T do3Receive(xt )4ytsgn(wt · xt )5Receive(yt )6if (yt yt ) then7wt 1wt yt xtmore generally yt xt , 08else wt 1wt9 return wT 1Mehryar Mohri - Foundations of Machine Learningpage 22

Separating HyperplaneMargin and errorsw·x 0w·x 0ρρyi (w · xi )wMehryar Mohri - Foundations of Machine Learningpage 23

Perceptron Stochastic Gradient DescentObjective function: convex but not differentiable.1F (w) TTt 1max 0, yt (w · xt ) E [f (w, x)]with f (w, x) max 0, y(w · x) .bx DStochastic gradient: for each xt , the update iswt 1wtwtw f (wt , xt )if differentiableotherwise,where 0 is a learning rate parameter.Here:wt yt xt if yt (wt · xt ) 0wt 1Mehryar Mohri - Foundations of Machine Learningwtotherwise.page 24

Perceptron Algorithm - Bound(Novikoff, 1962)Theorem: Assume that xt R for all t [1, T ] andNvRthat for some 0 and, for all t [1, T ] ,yt (v · xt ).vThen, the number of mistakes made by the22R/perceptron algorithm is bounded by.Proof: Let I be the set of ts at which there is anupdate and let M be the total number of updates.Mehryar Mohri - Foundations of Machine Learningpage 25

Summing up the assumption inequalities gives:M v·t Iyt xtv·t I (wt 1vwt )vv · wT 1 vwT 1 wtm ytm xtm (Cauchy-Schwarz ineq.)(tm largest t in I)wtm2 xtmwtm2 RMR21/2 2(definition of updates)21/2 2 ytm wtm · xtm1/20M R. (applying the same to previous ts in I)Mehryar Mohri - Foundations of Machine Learningpage 26

Notes: bound independent of dimension and tight.can be slow for small margin, it can convergencebe in (2 ) . among the many variants: voted perceptronNalgorithm. Predict according tosign (t I c t wt ) · x ,where ct is the number of iterations wt survives.{xt : t I} are the support vectors for theperceptron algorithm. non-separable case: does not converge.Mehryar Mohri - Foundations of Machine Learningpage 27

Perceptron - Leave-One-Out AnalysisTheorem: Let hS be the hypothesis returned by theperceptron algorithm for sample S (x1 , . . . , xT ) Dand let M (S) be the number of updates defining hS .Then,E m [R(hS )]S DEm 1S D2min(M (S), Rm 1/m 12m 1 ).Proof: Let S Dm 1 be a sample linearly separableand let x S . If hS {x} misclassifies x , then x mustbe a ‘support vector’ for hS (update at x ). Thus,Rloo (perceptron)Mehryar Mohri - Foundations of Machine LearningM (S).m 1page 28

Perceptron - Non-Separable Bound(MM and Rostamizadeh, 2013)Theorem: let I denote the set of rounds at whichthe Perceptron algorithm makes an update whenprocessing x1 , . . . , xT and let MT I . Then,MT inf 0,kuk2 1where R maxt2I kxt kL (u) PMehryar Mohri - Foundations of Machine Learningt2I1 qRL (u) 2,yt (u·xt ). page 29

Proof: forany t , 1yt (u·xt )1yt (u·xt )up these inequalities for t I yields: , summingX X yt (u · xt )yt (u · xt ) MT 1 t2It2IpMT R L (u) , by upper-bounding t I (yt u · xt ) as in the prooffor the separable case.solving the second-degree inequalityMT L (u) givespMT R Mehryar Mohri - Foundations of Machine LearningqpR2 2MT R, 4L (u)2R qL (u).page 30

Non-Separable Case - L2 Bound(Freund and Schapire, 1998; MM and Rostamizadeh, 2013)Theorem: let I denote the set of rounds at whichthe Perceptron algorithm makes an update whenprocessing x1 , . . . , xT and let MT I . Then,MTinf 0, u21 when xtMTL (u)22 L (u)422t I R for all t I, this impliesinf 0, uwhere L (u) Mehryar Mohri - Foundations of Machine LearningR2112 L (u)yt (u·xt ) t I2.page 31,xt22.

to separable case in higher Proof: Reduce problem1dimension. Let l 1, for t [1, T ] . Mapping (similar to trivial mapping):yt u·xtt t Ixt,1(N t)th component .xt,1.xt .xt xt,Nxt,N0.0u1Z.uu .0.0Mehryar Mohri - Foundations of Machine LearninguNZy 1 l1Zy T lTZu 1 Z page 321 2L (u)22.

Observe that the Perceptron algorithm makes the same predictions and makes updates at the samerounds when processing x1 , . . . , xT .For any t I,u · xtyt ltyt (u · xt ) yt ZZltyt u · xt ZZ1yt u · xt [yt (u · xt )] ZZ Summing up and using the proof in the separablecase yields:MTZt IMehryar Mohri - Foundations of Machine Learningyt (u · xt )xt 2 .t Ipage 33.

The inequality can be rewritten as1 2MT2L (u)2where r 2r MT22 r2 2r2 L (u)22 2MT2xt 2 .t ISelecting to minimize the bound givesand leads toMT2r22 2 MT L (u) 2,MT L (u) r MT L (u)2 (r 2 L (u)MTMT L (u) 2 )2 . Solving the second-degree inequalityMTMT L (u)r20yields directly the first statement. The second oneresults from replacing r with MT R .Mehryar Mohri - Foundations of Machine Learningpage 342r

Dual Perceptron AlgorithmDual-Perceptron( 0 )01typically 0 02 for t1 to T do3Receive(xt )T4ytsgn( s 1 s ys (xs · xt ))5Receive(yt )6if (yt yt ) then7tt 18 returnMehryar Mohri - Foundations of Machine Learningpage 35

Kernel Perceptron Algorithm(Aizerman et al., 1964)K PDS kernel.Kernel-Perceptron( 0 )01typically 0 02 for t1 to T do3Receive(xt )T4ytsgn( s 1 s ys K(xs , xt ))5Receive(yt )6if (yt yt ) then7tt 18 returnMehryar Mohri - Foundations of Machine Learningpage 36

Winnow AlgorithmWinnow( )1 w11/N2 for t1 to T do3Receive(xt )4ytsgn(wt · xt )5Receive(yt )6if (yt yt ) thenN7Zti 1 wt,i exp( yt xt,i )8for i1 to N dowt,i exp( yt xt,i )9wt 1,iZt10else wt 1wt11 return wT 1Mehryar Mohri - Foundations of Machine Learning(Littlestone, 1988)ytpage 37{ 1, 1}

NotesWinnow weighted majority:for yt,i xt,i { 1, 1}, sgn(wt · xt )coincides withthe majority vote.multiplying by e or e the weight of correct orincorrect experts, is equivalent to multiplyingby e 2 the weight of incorrect ones. Relationships with other algorithms: e.g., boostingand Perceptron (Winnow and Perceptron can beviewed as special instances of a general family).Mehryar Mohri - Foundations of Machine Learningpage 38

Winnow Algorithm - BoundR for all t [1, T ] andTheorem: Assume that xtNvRthat for some 0 and, v 0 for all t [1, T ] ,yt (v · xt ).v 1Then, the number of mistakes made by the222(R/) log N .Winnow algorithm is bounded byProof: Let I be the set of ts at which there is anupdate and let M be the total number of updates.Mehryar Mohri - Foundations of Machine Learningpage 39

NotesComparison with perceptron bound:dual norms: norms for xt and v .similar bounds with different norms.each advantageous in different cases:Winnow bound favorable when a sparse set ofexperts can predict well. For example, if v e1Nx{ 1}and t, log N vs N.Perceptron favorable in opposite situation. Mehryar Mohri - Foundations of Machine Learningpage 40

Winnow Algorithm - BoundNPotential:t i 1vi / vvilog.vwt,i(relative entropy)Upper bound: for each t in I ,t 1t PNvii 1 kvk1PNlogwt,iwt 1,ivii 1 kvk1Ztlog exp( yt xt,i )P N vi i 1 kvk1 yt xt,i log Zt PN log 1i 1 wt,i exp( yt xt,i ) log E exp( yt xt ) 1wt 22(Hoe ding) log exp( (2R1 ) /8) yt wt · xt 1 {z }2 0 2 R1/2 1 .Mehryar Mohri - Foundations of Machine Learningpage 41

Winnow Algorithm - BoundUpper bound: summing up the inequalities yieldsT 1M ( 2 R2 /21).Lower bound: note thatNN1 viv 1/ vlog vi1/N1 log N Thus,T 1logviv 1log Ni 1i 1and for all t ,viv 10 (property of relative entropy).t01Comparison:we obtainlog N log NMMehryar Mohri - Foundations of Machine Learninglog N.). For R2M ( 2 R2 /22 log NR22.page 42

ConclusionOn-line learning:wide and fast-growing literature.many related topics, e.g., game theory, textcompression, convex optimization.online to batch bounds and techniques.online version of batch algorithms, e.g.,regression algorithms (see regression lecture). Mehryar Mohri - Foundations of Machine Learningpage 43

References Aizerman, M. A., Braverman, E. M., & Rozonoer, L. I. (1964). Theoretical foundations of thepotential function method in pattern recognition learning. Automation and Remote Control,25, 821-837. Nicolò Cesa-Bianchi, Alex Conconi, Claudio Gentile: On the Generalization Ability of OnLine Learning Algorithms. IEEE Transactions on Information Theory 50(9): 2050-2057. 2004. Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. CambridgeUniversity Press, 2006. Yoav Freund and Robert Schapire. Large margin classification using the perceptronalgorithm. In Proceedings of COLT 1998. ACM Press, 1998. Nick Littlestone. From On-Line to Batch Learning. COLT 1989: 269-284.Nick Littlestone. "Learning Quickly When Irrelevant Attributes Abound: A New Linearthreshold Algorithm" Machine Learning 285-318(2). 1988.Mehryar Mohri - Foundations of Machine Learningpage 44

References Nick Littlestone, Manfred K. Warmuth: The Weighted Majority Algorithm. FOCS 1989:256-261. Tom Mitchell. Machine Learning, McGraw Hill, 1997. Novikoff, A. B. (1962). On convergence proofs on perceptrons. Symposium on theMathematical Theory of Automata, 12, 615-622. Polytechnic Institute of Brooklyn. Rosenblatt, Frank, The Perceptron: A Probabilistic Model for Information Storage andOrganization in the Brain, Cornell Aeronautical Laboratory, Psychological Review, v65, No. 6,pp. 386-408, 1958. Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.Mehryar Mohri and Afshin Rostamizadeh. Perceptron Mistake Bounds. arXiv:1305.0208,2013.Mehryar Mohri - Foundations of Machine Learningpage 45

AppendixMehryar Mohri - Foudations of Machine Learning

SVMs - Leave-One-Out Analysis(Vapnik, 1995)Theorem: let hS be the optimal hyperplane for asample S and let NSV (S) be the number of supportvectors defining hS. Then,E m [R(hS )]S DEm 1S D2min(NSV (S), Rm 1/m 12m 1 ).Proof: one part proven in lecture 4. The other part2due to i 1/Rm 1for xi misclassified by SVMs.Mehryar Mohri - Foundations of Machine Learningpage 47

ComparisonBounds on expected error, not high probabilitystatements.Leave-one-out bounds not sufficient to distinguishSVMs and perceptron algorithm. Note however:same maximum margin m 1 can be used in both.but different radius Rm 1 of support vectors. Difference: margin distribution.Mehryar Mohri - Foundations of Machine Learningpage 48

Mehryar Mohri - Foundations of Machine Learning page Motivation PAC learning: distribution fixed over time (training and test). IID assumption. On-line learning: no distributional assumption. worst-case analysis (adversarial). mixed training and test. Performance measure: mistake model, regret. 2