Algorithmic Trading: Reinforcement Learning

Transcription

Algorithmic Trading: ReinforcementLearningSebastian JaimungalUniversity of Torontomany thanks toÁlvaro Cartea, OxfordApr, 2017(c) Cartea & Jaimungal, 2016Algo TradingApr, 20171 / 25

Reinforcement Learning – Intro(c) Cartea & Jaimungal, 2016Algo TradingApr, 20172 / 25

Reinforcement Learning – IntroIReinforcement learning is unsupervised – based only on therewards from actions & how the system reactsIAs in continuous time stochastic control, the actions affectthe reward and the systemICan be model-freeIGoal is to maximize performance criteria" #XaH a (s) Eγ k R(ak ; Ska , Sk 1) Sta sk 0ISt S is the state of the system at time tIa A admissible set of actions which depend only on thestate of the systemIThe system evolves in an action dependent manner:aSt 1 F (St ; at )(c) Cartea & Jaimungal, 2016Algo TradingApr, 20173 / 25

Reinforcement Learning – IntroXk 1XkakSkRk 1Sk 1Xk 2Yk 1Sk 2Figure: Directed graphical representation. When Yt Xt theenvironment is fully observed.(c) Cartea & Jaimungal, 2016Algo TradingApr, 20174 / 25

Reinforcement Learning – IntroIIReinforcement learning aims to discover the best policy by:IExploration – trying an action, and see what the response is,update action choices (learn from the environment)IExploitation – use what you already know to make the bestaction choiceBoth are important!(c) Cartea & Jaimungal, 2016Algo TradingApr, 20175 / 25

Reinforcement Learning – IntroIA Bellman principle applies and one can show that H H asatisfieshiH(s) max E R(a; s, S1s,a ) γ H(S1s,a ) a AIWith known transition probability of states, this can beapplied recursively to find H and hence a iXh aH(s) maxPssR(a; s, s 0 ) γ H(s 0 )0a As0IMake an initial assumption on H – e.g., zerosIIterate until “converged”Ichoose actions which maximize the expression(c) Cartea & Jaimungal, 2016Algo TradingApr, 20176 / 25

Q-learning(c) Cartea & Jaimungal, 2016Algo TradingApr, 20177 / 25

Q-learningIQ-learning is an “off-policy” learning.[NB: “on-policy” means the algorithm estimates the value functionof the policy which generates the data. ]IDefineihQ(s, a) E R(a; s, S1s,a ) γ H(S1s,a )his,a 0 E R(a; s, S1s,a ) γ maxQ(S,a)10a Abecause H(s) max Q(s, a)a A(c) Cartea & Jaimungal, 2016Algo TradingApr, 20178 / 25

Q-learningIWe wish to approximate the expectation from actualobservations while you learn.IAlgorithm is ε–greedy: at iteration kISelect a random action ak with probability εk (Explore)IOtherwise select the current best policy (Exploit)a (s) arg max Q(s, a)a A(c) Cartea & Jaimungal, 2016Algo TradingApr, 20179 / 25

Q-learning1. Initialize Q(s, a) (random or often just to zero)2. Repeat (for every run)3.initialize state s4.Repeat (for each step in the run)5.Select ε-greedy action a6.Take action a, observe s 0 & reward R7.update Q according to 0 0Q(s, a) (1 αk ) Q(s, a) αk R γ maxQ(S,a)0a A8.9.update s s 0goto 5 until run is done10. goto 3 until all runs are done(c) Cartea & Jaimungal, 2016Algo TradingApr, 201710 / 25

Q-learningIRequire decreasing αk 0 s.t.Xαk andkIOftenεk (c) Cartea & Jaimungal, 2016Xαk2 kAB kandAlgo Tradingαk CD kApr, 201711 / 25

Q-learningIThe updating rule is akin to using updating to estimate themean µ E[X ] of a r.v. X , from its samples x {x1 , . . . , xn }k1Xµk xi µk 1 k1 (xk µk 1 )ki 1(c) Cartea & Jaimungal, 2016Algo TradingApr, 201712 / 25

Q-learning convergence(c) Cartea & Jaimungal, 2016Algo TradingApr, 201713 / 25

Q-learning convergenceIHere, we look at why Q-learning converges to the optimalsolutionINote first that the operate B which acts as followshis,a 0(BQ)(s, a) E R(a; s, S1s,a ) γ maxQ(S,a)10a Ais a contraction operator in the L -norm, i.e.kBQ1 BQ2 k γ kQ1 Q2 k (c) Cartea & Jaimungal, 2016Algo TradingApr, 201714 / 25

Q-learning convergenceProof:his,a 0s,a 0Q(S BQ1 BQ2 γ max E max,a) maxQ(S,a)1 12 1s,aa0 Aa0 A s,a 0s,a 0Q1 (S1 , a ) max γ max E maxQ2 (S1 , a )s,aa0 Aa0 A 0 00 0 γ max E 0 max0Q1 (s , a ) Q2 (s , a )s,as S ; a A γ Q1 Q2 Hence, there is a chance the procedure converges. but we needmore.(c) Cartea & Jaimungal, 2016Algo TradingApr, 201715 / 25

Q-learning convergenceTo illustrate that Q-learning converges to the optimal we willneed a general stochastic approximation result.TheoremAn iterative processζk 1 (x) (1 αk (x)) ζk (x) βk (x) Fk (x)converges to zero a.s. under the following assumptionsPP 2P 21.k α ,k αk ,k βk andE [βk (x) Fk ] E [αk (x) Fk ] uniformly a.s.2. kE[Fk (x) Fk , βk ]kW δkζk kW , for some δ (0, 1)3. V[Fk (x) Fk , βk ] C (1 kζk kW )2Here, k · kW denotes a weighted norm.(c) Cartea & Jaimungal, 2016Algo TradingApr, 201716 / 25

Q-learning convergenceNext, set k (s, a) Qk (s, a) Q (s, a)kwhere Qk is the k th iteration, i.e., Qk B̄ Q0 and 0 0B̄Qk (1 αk ) Qk (s, a) αk Rk γ maxQk (S , a )0a A(c) Cartea & Jaimungal, 2016Algo TradingApr, 201717 / 25

Q-learning convergenceFor Q-learning, by definition, we then have k (sk , ak )hiak ,sk 0 (1 αk )Qk (sk , ak ) αk Rk γ maxQ(s,a) Q(s,a)kkkk 1a0 A{z} Ψk (s, a)(c) Cartea & Jaimungal, 2016Algo TradingApr, 201718 / 25

Q-learning convergenceIWritinga,sΨk (s, a) : Rka,s γ maxQk (sk 1, a0 ) Q (s, a)0a Athen,E[Ψk (s, a) Fk ] (BQk )(s, a) Q (s, a)and since Q is a fixed point of B,E[Ψk (s, a) Fk ] (BQk BQ )(s, a)so thatE[Ψk (s, a) Fk ] γ Qk Q so part 2) of the general SA result holds(c) Cartea & Jaimungal, 2016Algo TradingApr, 201719 / 25

Q-learning convergenceNext, we need the variance to be bounded.V[Ψk (s, a) Fk ] a,sa,s0 V Rk γ maxQk (sk 1 , a ) Fka0 A a,s0 V Rka,s Fk 2 C Rka,s maxQ(s,a) Fk k 1ka0 A a,s V maxQk (sk 1, a0 ) Fk0a A C (1 kζk k2W )under the assumption of bounded rewards, the variance constraintholds(c) Cartea & Jaimungal, 2016Algo TradingApr, 201720 / 25

Dyna-Q Learning(c) Cartea & Jaimungal, 2016Algo TradingApr, 201721 / 25

Dyna-Q LearningIIdea is to combine experience and modelIUpdate Q from experience ( -greedy)ILearn a model from experienceISimulate from model, and update Q(c) Cartea & Jaimungal, 2016Algo TradingApr, 201722 / 25

Dyna-Q Learning1. Initialize Q(s, a) (random or often just to zero)2.initialize state s3.Select ε-greedy action a from Q4.Take action a, observe s 0 & reward R5.update Q according to Q(s, a) (1 αk ) Q(s, a) αk R γ max Q(S 0 , a0 )a0 A6.IUpdate Model:(s, a) 7 (s 0 , r ). Repeat n timesrandomly select se from previously visitedIrandomly select ae from previous actions at state seIUse model to get (es , ae) 7 (es 0 , re)IUpdate Q according to Q(es , ae) (1 αk ) Q(es , ae) αk re γ max Q(es 0 , a0 )a0 A7.8.update s s0goto 3(c) Cartea & Jaimungal, 2016Algo TradingApr, 201723 / 25

Dyna-Q LearningA mean-reverting asset11410 410.52109.50120-0.05310 1-100.0542-5015000-15-5-2(c) Cartea & Jaimungal, 2016-1012Algo Trading-2-0.0100.01Apr, 201724 / 25

Dyna-Q LearningAn execution 6000301020304050609.53010202010201010300102030(c) Cartea & Jaimungal, 2016405010.56000Algo Trading102030405060Apr, 201725 / 25

Dyna-Q LearningAn execution strategy300010.110.05109.959.92000100000(c) Cartea & Jaimungal, 20160.20.40.6Algo Trading0.81Apr, 201726 / 25

Reinforcement Learning { Intro I Reinforcement learning is unsupervised { based only on the rewards from actions & how the system reacts I As in continuous time stochastic control, the actions a ect the reward and the system I Can be model-free I Goal is to maximize performance criteria Ha(s) E X1 k 0 k R(a k;Sa;Sa