Ml Reinforcement Learning - New York University

Transcription

Foundations of Machine LearningReinforcement LearningMehryar MohriCourant Institute and Google Researchmohri@cims.nyu.edu

Reinforcement LearningAgent exploring environment.Interactions with em: find action policy that maximizescumulative reward over the course of interactions.Mehryar Mohri - Foundations of Machine Learningpage2

Key FeaturesContrast with supervised learning:no explicit labeled training data.distribution defined by actions taken.Delayed rewards or penalties.RL trade-off:exploration (of unknown states and actions) togain more reward information; vs.exploitation (of known information) to optimizereward. Mehryar Mohri - Foundations of Machine Learningpage3

ApplicationsRobot control e.g., Robocup Soccer Teams (Stone etal., 1999).Board games, e.g., TD-Gammon (Tesauro, 1995).Elevator scheduling (Crites and Barto, 1996).Ads placement.Telecommunications.Inventory management.Dynamic radio channel assignment.Mehryar Mohri - Foundations of Machine Learningpage4

This LectureMarkov Decision Processes (MDPs)PlanningLearningMulti-armed bandit problemMehryar Mohri - Foundations of Machine Learningpage5

Markov Decision Process (MDP)Definition: a Markov Decision Process is defined by:a set of decision epochs{0, . . . , T }.a set of states S , possibly infinite.a start state or initial state s0 ;a set of actions A , possibly infinite.a transition probability Pr[s s, a] : distribution overdestination states s (s, a) .a reward probability Pr[r s, a] : distribution overrewards returned r r(s, a) . Mehryar Mohri - Foundations of Machine Learningpage 6

ModelState observed at time t : stAction taken at time t : atS.actionA.stateAgentState reached st 1 (st , at ).rewardReward received: rt 1 r(st , at ).at /rt 1stMehryar Mohri - Foundations of Machine Learningat 1 /rt 2st 1st 2page 7Environment

MDPs - PropertiesFinite MDPs: A and S finite sets.Finite horizon when T .Reward r(s, a) : often deterministic function.Mehryar Mohri - Foundations of Machine Learningpage 8

Example - Robot Picking up Ballsstartsearch/[.1, R1]search/[.9, R1] carry/[.5, R3]otherMehryar Mohri - Foundations of Machine Learningcarry/[.5, -1] pickup/[1, R2]page9

PolicyDefinition: a policy is a mapping : SA.Objective: find policy maximizing expectedreturn.finite horizon return: Tt 01 r st , (st ) . infinite horizon return: t 0 t r st , (st ) . Theorem: for any finite MDP, there exists anoptimal policy (for any start state).Mehryar Mohri - Foundations of Machine Learningpage 10

Policy ValueDefinition: the value of a policy at state s isfinite horizon: T1V (s) Er st , (st ) s0 s .t 0 infinite horizon: discount factor[0, 1) , V (s) Etr st , (st ) s0 s .t 0Problem: find policystates.Mehryar Mohri - Foundations of Machine Learningwith maximum value for allpage 11

Policy EvaluationAnalysis of policy value: V (s) Etr st , (st ) s0 s .t 0 E[r(s, (s))] Etr st 1 , (st 1 ) s0 st 0 E[r(s, (s)] E[V ( (s, (s)))].Bellman equations (system of linear equations):V (s) E[r(s, (s)] sMehryar Mohri - Foundations of Machine LearningPr[s s, (s)]V (s ).page 12

Bellman Equation - Existence and UniquenessNotation:transition probability matrix Ps,s Pr[s s, (s)].value column matrix V V (s).expected reward column matrix: R E[r(s, (s)]. Theorem: for a finite MDP, Bellman’s equationadmits a unique solution given byV0 (IMehryar Mohri - Foundations of Machine LearningP)1R.page 13

Bellman Equation - Existence and UniquenessProof: Bellman’s equation rewritten asV R PV. P is a stochastic matrix, thus,P maxss Pss max ThisPimplies thatssPr[s s, (s)] 1. 1. The eigenvaluesP) isare all less than one and (IPofinvertible.Notes: general shortest distance problem (MM, 2002).Mehryar Mohri - Foundations of Machine Learningpage 14

Optimal PolicyDefinition: policy with maximal value for allstates s S.value of (optimal value): sS, V (s) max V (s). optimal state-action valuea function:s expectedreturn for taking action at state and thenfollowing optimal policy.Q (s, a) E[r(s, a)] E[V ( (s, a))]X E[r(s, a)] Pr[s0 s, a]V (s0 ).s0 2SMehryar Mohri - Foundations of Machine Learningpage 15

Optimal Values - Bellman EquationsProperty: the following equalities hold:sS, V (s) max Q (s, a).a AProof: by definition, for all s , V (s) max Q (s, a).a AIf for some s we had V (s) max Q (s, a) , thena Amaximizing action would define a better policy. Thus,V (s) maxa2AnE[r(s, a)] Mehryar Mohri - Foundations of Machine LearningXs0 2SoPr[s0 s, a]V (s0 ) .page 16

This LectureMarkov Decision Processes (MDPs)PlanningLearningMulti-armed bandit problemMehryar Mohri - Foundations of Machine Learningpage17

Known ModelSetting: environment model known.Problem: find optimal policy.Algorithms:value iteration.policy iteration.linear programming. Mehryar Mohri - Foundations of Machine Learningpage 18

Value Iteration Algorithm(V)(s) maxa AE[r(s, a)] (V) max{R P V}.sSPr[s s, a]V (s ) .ValueIteration(V0 )1 VV0V0 arbitrary value(1)2 while V(V)do3V(V)4 return (V)Mehryar Mohri - Foundations of Machine Learningpage 19

VI Algorithm - ConvergenceTheorem: for any initial value V0 , the sequencedefined by Vn 1 (Vn ) converge to V .Proof: we show that is -contracting for ·existence and uniqueness of fixed point for .for any s S, let a (s) be the maximizing actiondefining (V)(s) . Then, for s S and any U , (V)(s)(U)(s)(V)(s) ssSSMehryar Mohri - Foundations of Machine LearningE[r(s, a (s))] Pr[s s, a (s)][V(s )Pr[s s, a (s)] VUsSPr[s s, a (s)]U(s )U(s )] Vpage 20U.

Complexity and OptimalityComplexity: convergence in O(log 1 ) . Observe thatVn 1Thus,VnnVn(V0 )Vn(1V0(V0 )n1)V0n O log1-Optimality: let Vn 1 be the value returned. Then,VVn 1Thus,VVVVn 1Mehryar Mohri - Foundations of Machine Learning(Vn 1 ) (Vn 1 ) Vn 1Vn 1 Vn 1 Vn .1Vn 1Vnpage 21.

VI Algorithm - Examplea/[3/4, 2]1Vn 1 (1) max 2 a/[1/4, 2]b/[1, 2]d/[1, 3]c/[1, 2]231Vn (1) Vn (2) , 2 Vn (2)44Vn 1 (2) max 3 Vn (1), 2 Vn (2) .For V0 (1) 1, V0 (2) 1, 1/2 ,V1 (1) V1 (2) 5/2.But, V (1) 14/3, V (2) 16/3., Mohri - Foundations of Machine LearningMehryarpage22

Policy Iteration AlgorithmPolicyIteration( 0 )100 arbitrary policy2nil3 while ( ) do4VVpolicy evaluation: solve (IP )V R .56argmax {R P V}greedy policy improvement.7 returnMehryar Mohri - Foundations of Machine Learningpage 23

PI Algorithm - ConvergenceTheorem: let (Vn )n N be the sequence of policyvalues computed by the algorithm, then,VnVn 1V .Proof: let n 1 be the policy improvement at the nthiteration, then, by definition,Rn 1 P therefore, R note that (I X0n 1n 1(Ithus, Vn 1 (IMehryar Mohri - Foundations of Machine Learning PVnRn(IPn 1nVn Vn .)Vn .P n 1 ) 1 preserves ordering:P n 1 ) 1 X k 0 ( P n 1 )k XPn 1 )1Rn 1Vn .page 240.

NotesTwo consecutive policy values can be equal only atlast iteration.The total number of possible policies is A S , thus,this is the maximal possible number of iterations. A S best upper bound known O S . Mehryar Mohri - Foundations of Machine Learningpage 25

PI Algorithm - Examplea/[3/4, 2]1a/[1/4, 2]c/[1, 2]b/[1, 2]d/[1, 3]2Initial policy: 0 (1) b, 0 (2) c .Evaluation: V 0 (1) 1 V 0 (2)V 0 (2) 2 V 0 (2).1 2V 0 (2) Thus,V 0 (1) 11Mehryar Mohri - Foundations of Machine Learning.page26

VI and PI Algorithms - ComparisonTheorem: let (Un )n N be the sequence of policyvalues generated by the VI algorithm, and (Vn )n Nthe one generated by the PI algorithm. If U0 V0,then,nN, UnVnV .Proof: we first show that is monotonic. Let Uand V be such that U V and let be the policysuch that (U) R P U . Then,(U)R P VMehryar Mohri - Foundations of Machine Learningmax{R P V} page 27(V).

VI and PI Algorithms - Comparison The proof is by induction on n . Assume Uthen, by the monotonicity ofUn 1 Letn 1 Then,(Un )n,Vn ,(Vn ) max{R P Vn }.be the maximizing policy:n 1(Vn ) R argmax{R P Vn }.n 1 PMehryar Mohri - Foundations of Machine Learningn 1VnRn 1 Pn 1page 28Vn 1 Vn 1 .

NotesThe PI algorithm converges in a smaller number ofiterations than the VI algorithm due to the optimalpolicy.But, each iteration of the PI algorithm requirescomputing a policy value, i.e., solving a system oflinear equations, which is more expensive tocompute that an iteration of the VI algorithm.Mehryar Mohri - Foundations of Machine Learningpage 29

Primal Linear ProgramLP formulation: choose (s) 0 , withminVsubject to(s) 1 .s(s)V (s)s SsS, aA, V (s)E[r(s, a)] sSPr[s s, a]V (s ).Parameters:number rows: S A .number of columns: S . Mehryar Mohri - Foundations of Machine Learningpage 30

Dual Linear ProgramLP formulation:maxxsubject toE[r(s, a)] x(s, a)s S,a Asx(s , a) (s ) S,a AsS, aPr[s s, a] x(s , a)s S,a AA, x(s, a)0.Parameters: more favorable number of rows.number rows: S .number of columns: S A . Mehryar Mohri - Foundations of Machine Learningpage 31

This LectureMarkov Decision Processes (MDPs)PlanningLearningMulti-armed bandit problemMehryar Mohri - Foundations of Machine Learningpage32

ProblemUnknown model:transition and reward probabilities not known.realistic scenario in many practical problems, e.g.,robot control. Training information: sequence of immediaterewards based on actions taken.Learning approches:model-free: learn policy directly.model-based: learn model, use it to learn policy. Mehryar Mohri - Foundations of Machine Learningpage 33

ProblemHow do we estimate reward and transitionprobabilities?use equations derived for policy value and Qfunctions.but, equations given in terms of someexpectations.instance of a stochastic approximationproblem. Mehryar Mohri - Foundations of Machine Learningpage 34

Stochastic ApproximationProblem: find solution of x H(x) with x RN whileH(x) cannot be computed, e.g., H not accessible;i.i.d. sample of noisy observations H(xi ) wi ,available, i [1, m] , with E[w] 0. Idea: algorithm based on iterative technique:xt 1 (1 xt more generally xt )xtt 1Mehryar Mohri - Foundations of Machine Learning t [H(xt ) wt ]xt ].t [H(xt ) wt xt t D(xt , wt )page 35.

Mean EstimationTheorem: Let X be a random variable taking valuesin [0, 1] and let x0 , . . . , xm be i.i.d. values of X. Definethe sequence(µm )m N byµm 1 (1 m )µm m xm with µ0 x0 .Then, form[0, 1], withm m 0µmMehryar Mohri - Foundations of Machine Learninga.sX2 andm 1,m 0E[X].page 36

ProofProof: By the independence assumption, for m 0 , Var[µm 1 ] (1(12)m Var[µm ] m )Var[µm ] 2m Var[xm ]2m.We have m 0 since m 0 2m .Let 0 and suppose there exists N N such thatfor all m N, Var[µm ] . Then, for m N ,which impliesVar[µm 1 ]Var[µm ]Var[µm N ]Var[µN ]contradicting Var[µm N ] 0 .Mehryar Mohri - Foundations of Machine Learningm m Nn N2m,n when mpage 37m Nn N2n,

Mean Estimation Thus, for all NN there exists m0 N such thatVar[µm0 ] . Choose N large enough so thatm N, m . Then,Var[µm0 1 ] (1m0) m0 . Therefore, µmMehryar Mohri - Foundations of Machine Learningfor all m m0 (L2 convergence).page 38

Notesspecial case: m m1 .Strong law of large numbers. Connection with stochastic approximation.Mehryar Mohri - Foundations of Machine Learningpage 39

TD(0) AlgorithmIdea: recall Bellman’s linear equations giving VV (s) E[r(s, (s)] sPr[s s, (s)]V (s ) E r(s, (s)) V (s ) s .sAlgorithm: temporal difference (TD).sample new state s .update: depends on number of visits of s . V (s)(1)V (s) [r(s, (s)) V (s )] V (s) [r(s, (s)) V (s ) V (s)].temporal difference of V valuesMehryar Mohri - Foundations of Machine Learningpage 40

TD(0) AlgorithmTD(0)()1 VV0 initialization.2 for t0 to T do3sSelectState()4for each step of epoch t do5rReward(s, (s))6sNextState( , s)7V (s)(1)V (s) [r V (s )]8ss9 return VMehryar Mohri - Foundations of Machine Learningpage 41

Q-Learning AlgorithmIdea: assume deterministic rewards.Q (s, a) E[r(s, a)] sSPr[s s, a]V (s ) E[r(s, a) max Q (s , a)]sa A[0, 1] depends on number of visits.Algorithm:sample new state s .update: Q(s, a)Q(s, a) (1Mehryar Mohri - Foundations of Machine Learning)[r(s, a) max Q(s , a )].apage 42A

Q-Learning Algorithm(Watkins, 1989; Watkins and Dayan 1992)Q-Learning( )1 QQ0initialization, e.g., Q0 0.2 for t0 to T do3sSelectState()4for each step of epoch t do5aSelectAction( , s)policy derived from Q, e.g., -greedy.6rReward(s, a)7sNextState(s, a)8Q(s, a)Q(s, a) r maxa Q(s , a ) Q(s, a)9ss10 return QMehryar Mohri - Foundations of Machine Learningpage 43

NotesCan be viewed as a stochastic formulation of thevalue iteration algorithm.Convergence for any policy so long as states andactions visited infinitely often.How to choose the action at each iteration?Maximize reward? Explore other actions? Qlearning is an off-policy method: no control overthe policy.Mehryar Mohri - Foundations of Machine Learningpage 44

PoliciesEpsilon-greedy strategy:with probability 1 greedy action from s ;with probability random action. Epoch-dependent strategy (Boltzmann exploration):ept (a s, Q) 0: greedy selection. larger : random action.aQ(s,a)tAeQ(s,a )t,ttMehryar Mohri - Foundations of Machine Learningpage 45

Convergence of Q-LearningTheorem: consider a finite MDP. Assume that for2(s,a) ,aAtsSalland, t 0t 0 t (s, a) with t (s, a) [0, 1] . Then, the Q-learning algorithmconverges to the optimal value Q (with probabilityone).note: the conditions on t (s, a) impose that eachstate-action pair is visited infinitely many times. Mehryar Mohri - Foundations of Machine Learningpage 46

SARSA: On-Policy AlgorithmSARSA( )1 QQ0initialization, e.g., Q0 0.2 for t0 to T do3sSelectState()4aSelectAction( (Q), s)policy derived from Q, e.g., -greedy.5for each step of epoch t do6rReward(s, a)7sNextState(s, a)8aSelectAction( (Q), s )policy derived from Q, e.g., -greedy.9Q(s, a)Q(s, a) t (s, a) r Q(s , a ) Q(s, a)10ss11aa12 return QMehryar Mohri - Foundations of Machine Learningpage 47

NotesDifferences with Q-learning:two states: current and next states.maximum reward for next state not used fornext state, instead new action. SARSA: name derived from sequence of updates.Mehryar Mohri - Foundations of Machine Learningpage 48

TD(λ) AlgorithmIdea:TD(0) or Q-learning only use immediate reward.use multiple steps ahead instead, for n steps: Rtn rt 1 rt 2 . . . V (s)V (s) (RtnTD(λ) uses Rt (1Algorithm:V (s)Mehryar Mohri - Foundations of Machine Learningn 1rt n nV (s)).)V (s) n 0(RtnRtn .V (s)).page 49V (st n )

TD(λ) AlgorithmTD( )()1 VV0 initialization.2 e03 for t0 to T do4sSelectState()5for each step of epoch t do6sNextState( , s)7r(s, (s)) V (s ) V (s)8e(s)e(s) 19for u S do10if u s then11e(u)e(u)12V (u)V (u) e(u)13ss14 return VMehryar Mohri - Foundations of Machine Learningpage 50

TD-Gammon(Tesauro, 1995)Large state space or costly actions: use regression algorithmto estimate Q for unseen values.Backgammon: large number of positions: 30 pieces, 24-26 locations,large number of moves.TD-Gammon: used neural networks. non-linear form of TD(λ),1.5M games played,almost as good as world-class humans (master level).Mehryar Mohri - Foundations of Machine Learningpage51

This LectureMarkov Decision Processes (MDPs)PlanningLearningMulti-armed bandit problemMehryar Mohri - Foundations of Machine Learningpage52

Multi-Armed Bandit Problem(Robbins, 1952)Problem: gambler must decide which arm of a Nslot machine to pull to maximize his total rewardin a series of trials.stochastic setting: N lever reward distributions.adversarial setting: reward selected by adversaryaware of all the past. Mehryar Mohri - Foundations of Machine Learningpage 53

ApplicationsClinical trials.Adaptive routing.Ads placement on pages.Games.Mehryar Mohri - Foundations of Machine Learningpage 54

Multi-Armed Bandit GameFor t 1 to T doadversary determines outcome yt Y .player selects probability distribution pt and pullslever It {1, . . . , N }, It pt .player incurs loss L(It , yt ) (adversary is informedof pt and It . Objective: minimize regretTRegret(T ) TL(It , yt )t 1Mehryar Mohri - Foundations of Machine Learningmini 1,.,NL(i, yt ).t 1page 55

NotesPlayer is informed only of the loss (or reward)corresponding to his own action.Adversary knows past but not action selected.Stochastic setting: loss (L(1, yt ), . . . , L(N, yt )) drawnaccording to some distribution D D1 · · · DN .Regret definition modified by taking expectations.Exploration/Exploitation trade-off: playing the bestarm found so far versus seeking to find an armwith a better payoff.Mehryar Mohri - Foundations of Machine Learningpage 56

NotesEquivalent views:special case of learning with partial information.one-state MDP learning problem. Simple strategy: -greedy: play arm with bestempirical reward with probability 1 t , randomarm with probability t .Mehryar Mohri - Foundations of Machine Learningpage 57

Exponentially Weighted AverageAlgorithm: Exp3, defined for , 0 bypi,t (1with i)[1, N ], li,t expNi 1expt 1s 1 li,tt 1s 1 li,tL(It ,yt )pIt ,t 1It i .Guarantee: expected regret ofO(Mehryar Mohri - Foundations of Machine LearningN T log N ).page 58 N,

Exponentially Weighted AverageProof: similar to the one for the ExponentiallyWeighted Average with the additional observationthat:E[li,t ] Ni 1Mehryar Mohri - Foundations of Machine Learningt ,yt )pi,t L(IpI ,t 1It i L(i, yt ).tpage 59

References Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. 2 vols. Belmont, MA:Athena Scientific, 2007. Mehryar Mohri. Semiring Frameworks and Algorithms for Shortest-Distance Problems.Journal of Automata, Languages and Combinatorics, 7(3):321-350, 2002. Martin L. Puterman Markov decision processes: discrete stochastic dynamic programming.Wiley-Interscience, New York, 1994. Robbins, H. (1952), "Some aspects of the sequential design of experiments", Bulletin of theAmerican Mathematical Society 58 (5): 527–535. Sutton, Richard S., and Barto, Andrew G. Reinforcement Learning: An Introduction. MIT Press,1998.Mehryar Mohri - Foundations of Machine Learningpage 60

References Gerald Tesauro. Temporal Difference Learning and TD-Gammon. Communications of the ACM38 (3), 1995. Watkins, Christopher J. C. H. Learning from Delayed Rewards. Ph.D. thesis, CambridgeUniversity, 1989. Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning,Vol. 8, No. 3-4,1992.Mehryar Mohri - Foundations of Machine Learningpage 61

AppendixMehryar Mohri - Foudations of Machine Learning

Stochastic ApproximationProblem: find solution of x H(x) with x RN whileH(x) cannot be computed, e.g., H not accessible;i.i.d. sample of noisy observations H(xi ) wi ,available, i [1, m] , with E[w] 0. Idea: algorithm based on iterative technique:xt 1 (1 xt more generally xt )xtt 1Mehryar Mohri - Foundations of Machine Learning t [H(xt ) wt ]xt ].t [H(xt ) wt xt t D(xt , wt )page 63.

Supermartingale ConvergenceTheorem: let Xt , Yt , Zt be non-negative randomvariables such that t 0 Yt . If the followingcondition holds: E Xt 1 Ft Xt Yt Zt , then,Xt converges to a limit (with probability one). t 0Zt .Mehryar Mohri - Foundations of Machine Learningpage 64

Convergence AnalysisConvergence of xt 1 xt history Ft defined byFt {(xt )tTheorem: let : xassume that K1 , K2 : Ec:t 0,Then, xtt, ( t12xD(xt , wt )22t D(xt , wt ))tt , (wt )t t }.x22t 0a.s ,t 02tfor some x andK1 K2 (xt );Ft(xt ) E D(xt , wt ) Ftt, withc (xt ); .x .Mehryar Mohri - Foundations of Machine Learningpage 65

Convergence AnalysisProof: since(xt 1 ) (xt ) is a quadratic function,(xt ) (xt 1Thus,E(xt 1 ) Ft (xt ) (xt ) (xt ) 1xt ) (xt 12xt )(xt ) E D(xt , wt ) Ft ttc(xt ) 2t K122t2tc2t22(xt )(xt 1ED(xt , wt )22Ftn

Inventory management. Dynamic radio channel assignment. Mehryar Mohri - Foundations of Machine Learning page 5 . destination states . a . Foundations of Machine Learning page VI and PI Algorithm