Introduction To Deep Q-network

Transcription

Introduction to Deep Q-networkPresenter: Yunshu DuCptS 580 Deep Learning10/10/2016

Deep Q-network (DQN)

Deep Q-network (DQN) An artificial agent for general Atari game playing–Learn to master 49 different Atari games directly from gamescreens–Beat the best performing learner from the same domain in 43games–Excel human expert in 29 games

Deep Q-network (DQN) A demo on DQN playing Atari Breakouthttps://www.youtube.com/watch?v V1eYniJ0Rnk

DQN is reinforcement learning CNN magic! “Q”: Q-learning, a reinforcement learning (RL) method, theagent interact with the environment to maximize futurerewards “Deep”, “network” : deep artificial neural networks tolearn general representation in complex environments

Q-Learning Action-value (Q) function Optimal Q function obeys Bellman equation The Q-Learning ystifying-deep-reinforcement-learning/

Q-Learning Exploration vs. ExploitationDo I want to know as much as possible, or do my best atthings that I already know?– ε-greedy exploration to select mystifying-deep-reinforcement-learning/

Example: Q-Learning for Atari Breakout

Q-Learning But what if there are too many states/actions?–Solution: deep convolutional network as functionapproximatorweights

Deep Convolutional neural network (CNN) Extracts features directly from raw pixel Atari game image pre-processing: rks/

p-qlearningDQN Architectureoutput size (84-8)/4 1 20*20*32output size (20-4)/2 1 9*9*643x3Input image: 84x84x48x884x8432 filters8x8 stride 4#W0 8192(8*8*4)*3264 filters4x4 stride 2#W1 32768(4*4*32)*64

p-qlearningDQN ArchitectureConvolutionaloutput size (84-8)/4 1 20*20*32output size (20-4)/2 1 9*9*64output size 7*7*643x37x7Input image: 84x84x48x884x8432 filters8x8 stride 4#W0 8192(8*8*4)*3264 filters4x4 stride 2#W1 32768 64 filters(4*4*32)*64 3x3 stride#W2 368641 (3*3*64)*64

p-qlearningAny missingcomponent?DQN ArchitectureFully ConnectedConvolutionaloutput size (84-8)/4 1 20*20*32output size (20-4)/2 1 9*9*64output size 7*7*643x37x7Input image: 84x84x48x884x8432 filters8x8 stride 4#W0 8192(8*8*4)*3264 filters4x4 stride 2#W1 32768 64 filters(4*4*32)*64 3x3 stride#W2 368641 (3*3*64)*64Reshape 512 rectifier3136Output Q valuesfor each action

Deep Q-Learning Problem: Reinforcement learning is known to be unstableor even to diverge when use a nonlinear functionapproximator such as a neural network–Correlation between samples–Small updates to Q value may significantly change the policyTsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEEtransactions on automatic control, 42(5), 674-690.

Deep Q-Learning Solutions in DQN–Experience replay Each iterations store experience sequenceet (st,at,rt,st 1), Dt {e1, ,et} Randomly drawn samples of experience (s,a,r,s′) U(D) and applyQ update in minibatch fashion–Separate target network Clone Q(s,a; θ) to a separate target Qˆ(s,a; θ–) every C time step Treat y as the target and θ– are held fixed while update–Reward clipping {-1, 1}

Deep Q-network (DQN) Minimize squared error losstargetprediction Stochastic gradient decent w.r.t. weights–Minibatch of size 32 Update weights using RMSprop: divide weights by arunning averagehttp://www.cs.toronto.edu/ tijmen/csc321/slides/lecture slides lec6.pdfhttps://en.wikipedia.org/wiki/Stochastic gradient descent

DQN: Putting TogetherQ valuefor actionsInputCNNStore experience{st,at,rt,st 1} thenSample minibatchCalculate gradient and update weightsCalculate target foreach sample

Q nforcement-learning/

But It’s not perfect!Andrej Karpathy’s blog Reward clipping–Agent can’t distinguish different scales of rewards(e.g., Macman) Limited experience replay–Might through away important experiences High computational complexityAlmost 10 days to train one game on a single GPU! Slower onphysical robots– 10 GB to store experiences–

David Silver’s tutorial on Deep Reinforcement LearningICML 2016, http://icml.cc/2016/tutorials/deep rl tutorial.pdfBeyond DQN More stabled learning–Double DQN (Van, H et al. (2015)): use two Q-networks, onefor select action, the other for evaluate action Limited experience replay–Prioritized Experience Replay (Schaul, T et al. (2016)): weightexperience according to surprise High computational time complexityParallel/distributed computing (Nair, A et al. (2015))– Dueling network (Wang, Z et al. (2015))L split DQN into twochannels– Asynchronous RL (A3C) (Mnih, V et al. (2016)): can be trainedin CPU–

David Silver’s tutorial on Deep Reinforcement LearningICML 2016, http://icml.cc/2016/tutorials/deep rl tutorial.pdfBeyond DQN

David Silver’s tutorial on Deep Reinforcement LearningICML 2016, http://icml.cc/2016/tutorials/deep rl tutorial.pdfBeyond DQN Deep Policy Network for continuous controlSimulated robots– Physical robots–

Beyond DQNMastering the game ofGo with deep neuralnetworks and tree searchSilver, D., Huang, A.,Maddison, C.J., Guez, A.,Sifre, L., Van Den Driessche,G., Schrittwieser, J.,Antonoglou, I.,Panneershelvam, V., Lanctot,M. and Dieleman, S., 2016.

So DQN is not magic Q learning CNN as function approximator Experience replay separate target reward clipping stabilize learning To be continue

Introduction to Deep Q-networkPresenter: Yunshu DuCptS 580 Deep Learning10/10/2016

Oct 10, 2016 · Deep Q-network (DQN) An artificial agent for general Atari game playing –Learn to master 49 different Atari games directly from game screens –Beat the best performing learner from th