Old Dog Learns New Tricks: Randomized UCB For Bandit .

Transcription

Old Dog Learns New Tricks:Randomized UCB for Bandit ProblemsAISTATS 2020

Motivating example: clinical trials Do not have complete information about the effectiveness or side-effects of the drugs. Aim: Infer the “best” drug by running a sequence of trials.1

Motivating example: clinical trials Do not have complete information about the effectiveness or side-effects of the drugs. Aim: Infer the “best” drug by running a sequence of trials. Abstraction to Multi-armed Bandits: Each drug choice is mapped to an arm and thedrug’s effectiveness is mapped to the arm’s reward.1

Motivating example: clinical trials Do not have complete information about the effectiveness or side-effects of the drugs. Aim: Infer the “best” drug by running a sequence of trials. Abstraction to Multi-armed Bandits: Each drug choice is mapped to an arm and thedrug’s effectiveness is mapped to the arm’s reward. Administering a drug is an action that is equivalent to pulling the corresponding arm. Thetrial goes on for T rounds.1

Bandits 101: problem setupInitialize the expected rewards according to some prior knowledge.for t 1 T doSELECT: Use a bandit algorithm to decide which arm to pull.ACT and OBSERVE: Pull the selected arm and observe the reward.UPDATE: Update the estimated reward for the arm(s).end2

Bandits 101: problem setupInitialize the expected rewards according to some prior knowledge.for t 1 T doSELECT: Use a bandit algorithm to decide which arm to pull.ACT and OBSERVE: Pull the selected arm and observe the reward.UPDATE: Update the estimated reward for the arm(s).end Stochastic bandits: Reward for each arm is sampled i.i.d from its underlying distribution.2

Bandits 101: problem setupInitialize the expected rewards according to some prior knowledge.for t 1 T doSELECT: Use a bandit algorithm to decide which arm to pull.ACT and OBSERVE: Pull the selected arm and observe the reward.UPDATE: Update the estimated reward for the arm(s).end Stochastic bandits: Reward for each arm is sampled i.i.d from its underlying distribution. Objective: Minimize the expected cumulative regret R(T ):R(T ) T X E[Reward for best arm] E[Reward for arm pulled in round t]t 12

Bandits 101: problem setupInitialize the expected rewards according to some prior knowledge.for t 1 T doSELECT: Use a bandit algorithm to decide which arm to pull.ACT and OBSERVE: Pull the selected arm and observe the reward.UPDATE: Update the estimated reward for the arm(s).end Stochastic bandits: Reward for each arm is sampled i.i.d from its underlying distribution. Objective: Minimize the expected cumulative regret R(T ):R(T ) T X E[Reward for best arm] E[Reward for arm pulled in round t]t 1 Minimizing R(T ) boils down to a exploration-exploitation trade-off.2

Bandits 101: structured bandits In problems with a large number of arms, learning about each arm separately is inefficient. use a shared parameterization for the arms. Structured bandits: Each arm i has a feature vector xi andthere exists an unknown vector θ such that E[reward for arm i] g(xi , θ ). Linear bandits: g(xi , θ ) hxi , θ i. Generalized linear bandits: g is a strictly increasing, differentiable link function.E.g. g(x , θ ) 1/(1 exp( hxi , θ i)) for logistic bandits.3

Bandits 101: algorithms Optimism in the Face of Uncertainty (OFU): Uses closed-form high-probabilityconfidence sets. Theoretically optimal. Does not depend on the exact distribution of rewards. Poor empirical performance on typical problem instances. Thompson Sampling (TS): Randomized strategy that samples from a posteriordistribution. Good empirical performance on typical problem instances. Depends on the reward distributions. Computationally expensive in the absence ofclosed-form posteriors. Theoretically sub-optimal in the (generalized) linear bandit setting.4

Bandits 101: algorithms Optimism in the Face of Uncertainty (OFU): Uses closed-form high-probabilityconfidence sets. Theoretically optimal. Does not depend on the exact distribution of rewards. Poor empirical performance on typical problem instances. Thompson Sampling (TS): Randomized strategy that samples from a posteriordistribution. Good empirical performance on typical problem instances. Depends on the reward distributions. Computationally expensive in the absence ofclosed-form posteriors. Theoretically sub-optimal in the (generalized) linear bandit setting.Can we obtain the best of OFU and TS?4

The RandUCB meta-algorithmTheoretical study5

RandUCB Meta-algorithm Generic OFU algorithm: If µbi (t) is the mean reward for arm i at round t, Ci (t) is thecorresponding confidence set, pick the arm with the largest upper confidence bound.it arg max {bµi (t) β Ci (t)} .i [K ]Here, β is deterministic and chosen to trade off exploration and exploitation optimally.6

RandUCB Meta-algorithm Generic OFU algorithm: If µbi (t) is the mean reward for arm i at round t, Ci (t) is thecorresponding confidence set, pick the arm with the largest upper confidence bound.it arg max {bµi (t) β Ci (t)} .i [K ]Here, β is deterministic and chosen to trade off exploration and exploitation optimally. RandUCB: Replace deterministic β by a random variable Zt :it arg max {bµi (t) Zt Ci (t)} .i [K ]Z1 , . . . , ZT are i.i.d. samples from the sampling distribution.6

RandUCB Meta-algorithm Generic OFU algorithm: If µbi (t) is the mean reward for arm i at round t, Ci (t) is thecorresponding confidence set, pick the arm with the largest upper confidence bound.it arg max {bµi (t) β Ci (t)} .i [K ]Here, β is deterministic and chosen to trade off exploration and exploitation optimally. RandUCB: Replace deterministic β by a random variable Zt :it arg max {bµi (t) Zt Ci (t)} .i [K ]Z1 , . . . , ZT are i.i.d. samples from the sampling distribution. Uncoupled RandUCB:it arg max {bµi (t) Zi,t Ci (t)} .i [K ]6

RandUCB Meta-algorithm General sampling distribution: Discrete distribution on the interval [L, U], supported onM equally-spaced points, α1 L, . . . , αM U. Define pm : P (Z αm ).7

RandUCB Meta-algorithm General sampling distribution: Discrete distribution on the interval [L, U], supported onM equally-spaced points, α1 L, . . . , αM U. Define pm : P (Z αm ). Default sampling distribution: Gaussian distribution truncated in the [0, U] interval withtunable hyper-parameters ε, σ 0 such that pM ε andFor 1 m M 1,2pm exp( αm/2σ 2 ).7

RandUCB Meta-algorithm General sampling distribution: Discrete distribution on the interval [L, U], supported onM equally-spaced points, α1 L, . . . , αM U. Define pm : P (Z αm ). Default sampling distribution: Gaussian distribution truncated in the [0, U] interval withtunable hyper-parameters ε, σ 0 such that pM ε andFor 1 m M 1,2pm exp( αm/2σ 2 ). Default choice across bandit problems: Coupled RandUCB with U O(β), M 10,ε 10 8 , σ 0.25.7

RandUCB for multi-armed bandits Let Yi (t) be the sum of rewards obtained for arm i until round t and si (t) be the numberof pulls for arm i until round t.pMean µbi (t) Yi (t)/si (t) and confidence interval Ci (t) 1/si (t).8

RandUCB for multi-armed bandits Let Yi (t) be the sum of rewards obtained for arm i until round t and si (t) be the numberof pulls for arm i until round t.pMean µbi (t) Yi (t)/si (t) and confidence interval Ci (t) 1/si (t). OFU algorithm for MAB: Pull each arm once, and for t K , pull arms()1bi (t) β.it arg max µsi (t)i8

RandUCB for multi-armed bandits Let Yi (t) be the sum of rewards obtained for arm i until round t and si (t) be the numberof pulls for arm i until round t.pMean µbi (t) Yi (t)/si (t) and confidence interval Ci (t) 1/si (t). OFU algorithm for MAB: Pull each arm once, and for t K , pull arms()1bi (t) β.it arg max µsi (t)i UCB1 [Auer, Cesa-Bianchi and Fischer 2002]: β p2 ln(T )8

RandUCB for multi-armed bandits Let Yi (t) be the sum of rewards obtained for arm i until round t and si (t) be the numberof pulls for arm i until round t.pMean µbi (t) Yi (t)/si (t) and confidence interval Ci (t) 1/si (t). OFU algorithm for MAB: Pull each arm once, and for t K , pull arms()1bi (t) β.it arg max µsi (t)i UCB1 [Auer, Cesa-Bianchi and Fischer 2002]: β p RandUCB: L 0, U 2 ln(T ).p2 ln(T ) We can also construct optimistic Thompson sampling and adaptive ε-greedy algorithms.8

Regret of RandUCB for multi-armed banditsTheorem 1 (Instance-dependent regret of uncoupled RandUCB for MAB)If i µ1 µi is the gap for arm i, and Z takes M different values 0 α1 · · · αM withprobabilities p1 , p2 , . . . , pM , the regret R(T ) of uncoupled RandUCB can be bounded as:! X2M 12 Te 2αM αM.O i pM i 09

Regret of RandUCB for multi-armed banditsTheorem 1 (Instance-dependent regret of uncoupled RandUCB for MAB)If i µ1 µi is the gap for arm i, and Z takes M different values 0 α1 · · · αM withprobabilities p1 , p2 , . . . , pM , the regret R(T ) of uncoupled RandUCB can be bounded as:! X2M 12 Te 2αM αM.O i pM i 0 P 1 iregret. Using U αM 2 ln T results in the problem-dependent O ln T 9

Regret of RandUCB for multi-armed banditsTheorem 1 (Instance-dependent regret of uncoupled RandUCB for MAB)If i µ1 µi is the gap for arm i, and Z takes M different values 0 α1 · · · αM withprobabilities p1 , p2 , . . . , pM , the regret R(T ) of uncoupled RandUCB can be bounded as:! X2M 12 Te 2αM αM.O i pM i 0 P 1 iregret. Using U αM 2 ln T results in the problem-dependent O ln T e KT ) regret matching that of Standard reduction implies a problem-independent O(UCB1 and Thompson sampling [Agrawal and Goyal, 2012].9

Regret of RandUCB for multi-armed banditsTheorem 1 (Instance-dependent regret of uncoupled RandUCB for MAB)If i µ1 µi is the gap for arm i, and Z takes M different values 0 α1 · · · αM withprobabilities p1 , p2 , . . . , pM , the regret R(T ) of uncoupled RandUCB can be bounded as:! X2M 12 Te 2αM αM.O i pM i 0 P 1 iregret. Using U αM 2 ln T results in the problem-dependent O ln T e KT ) regret matching that of Standard reduction implies a problem-independent O(UCB1 and Thompson sampling [Agrawal and Goyal, 2012]. We also show the same problem-independent regret for the default coupled variant ofRandUCB.9

RandUCB for linear banditsPt 1Pt 1bi (t) hθbt , xi i Let Xt xit and Mt : λId 1 X X T . θbt : Mt 1 1 Y X . Mean µand confidence width Ci (t) kxi kM 1 .t10

RandUCB for linear banditsPt 1Pt 1bi (t) hθbt , xi i Let Xt xit and Mt : λId 1 X X T . θbt : Mt 1 1 Y X . Mean µand confidence width Ci (t) kxi kM 1 .t OFU algorithm for linear bandit: Pull arm:noit arg max hθbt , xi i β kxi kM 1 .i [K ]t10

RandUCB for linear banditsPt 1Pt 1bi (t) hθbt , xi i Let Xt xit and Mt : λId 1 X X T . θbt : Mt 1 1 Y X . Mean µand confidence width Ci (t) kxi kM 1 .t OFU algorithm for linear bandit: Pull arm:noit arg max hθbt , xi i β kxi kM 1 .ti [K ] OFU [Abbasi-Yadkori, Pál and Szepesvári 2011]: β λ 12pln(T 2 λ d det(Mt )).10

RandUCB for linear banditsPt 1Pt 1bi (t) hθbt , xi i Let Xt xit and Mt : λId 1 X X T . θbt : Mt 1 1 Y X . Mean µand confidence width Ci (t) kxi kM 1 .t OFU algorithm for linear bandit: Pull arm:noit arg max hθbt , xi i β kxi kM 1 .i [K ]tp OFU [Abbasi-Yadkori, Pál and Szepesvári 2011]: β λ 12 ln(T 2 λ d det(Mt )).h ip RandUCB: L 0, U 3 λ 12 d ln (T T 2 /dλ) .10

Regret of RandUCB for linear banditsTheorem 2p T. For any c2 c1 , the regret ofLet c1 λ 12 d ln (T T 2 /dλ) and c3 : 2d ln 1 dλRandUCB for linear bandits is bounded by p2(c1 c2 ) 1 c3 T T P ( Z c2 ) 1.P (Z c1 ) P ( Z c2 )11

Regret of RandUCB for linear banditsTheorem 2p T. For any c2 c1 , the regret ofLet c1 λ 12 d ln (T T 2 /dλ) and c3 : 2d ln 1 dλRandUCB for linear bandits is bounded by p2(c1 c2 ) 1 c3 T T P ( Z c2 ) 1.P (Z c1 ) P ( Z c2 ) Setting U 3c1 c2 ensures P (Z c1 ) is a positive constant and P ( Z c2 ) 0, eresulting in O(dT ) regret bound.11

Regret of RandUCB for linear banditsTheorem 2p T. For any c2 c1 , the regret ofLet c1 λ 12 d ln (T T 2 /dλ) and c3 : 2d ln 1 dλRandUCB for linear bandits is bounded by p2(c1 c2 ) 1 c3 T T P ( Z c2 ) 1.P (Z c1 ) P ( Z c2 ) Setting U 3c1 c2 ensures P (Z c1 ) is a positive constant and P ( Z c2 ) 0, eresulting in O(dT ) regret bound. Regret bound does not depend on K and holds for infinite arms.11

Regret of RandUCB for linear banditsTheorem 2p T. For any c2 c1 , the regret ofLet c1 λ 12 d ln (T T 2 /dλ) and c3 : 2d ln 1 dλRandUCB for linear bandits is bounded by p2(c1 c2 ) 1 c3 T T P ( Z c2 ) 1.P (Z c1 ) P ( Z c2 ) Setting U 3c1 c2 ensures P (Z c1 ) is a positive constant and P ( Z c2 ) 0, eresulting in O(dT ) regret bound. Regret bound does not depend on K and holds for infinite arms. Matches the bound of OFU in [Abbasi-Yadkori et al., 2011] and is better than the O(d 3/2 T ) bound for TS [Agrawal and Goyal, 2013].11

Regret of RandUCB for linear banditsTheorem 2p T. For any c2 c1 , the regret ofLet c1 λ 12 d ln (T T 2 /dλ) and c3 : 2d ln 1 dλRandUCB for linear bandits is bounded by p2(c1 c2 ) 1 c3 T T P ( Z c2 ) 1.P (Z c1 ) P ( Z c2 ) Setting U 3c1 c2 ensures P (Z c1 ) is a positive constant and P ( Z c2 ) 0, eresulting in O(dT ) regret bound. Regret bound does not depend on K and holds for infinite arms. Matches the bound of OFU in [Abbasi-Yadkori et al., 2011] and is better than the O(d 3/2 T ) bound for TS [Agrawal and Goyal, 2013]. e We prove a similar O(dT ) bound for generalized linear bandits.11

The RandUCB meta-algorithmEmpirical study12

Experiments - multi-armed bandit B-TS: Thompson Sampling with a beta posterior KL-UCB [Garivier and Cappé, 2011]: UCB with tighter confidence intervals. Randomized exploration baselines: Giro [Kveton et al., 2019c], PHE [Kveton et al., 2019b]13

Experiments - linear bandit Lin-TS: Thompson Sampling with a Gaussian posterior ε-greedy [Langford and Zhang, 2008] Randomized exploration baseline: LinPHE [Kveton et al., 2019a]14

Experiments - logistic bandit GLM-TS [Kveton et al., 2019d]: TS with a Laplace approximation to the posterior. GLM-UCB [Filippi et al., 2010] and UCB-GLM [Li et al., 2017] ε-greedy [Langford and Zhang, 2008] Randomized exploration baseline: LogPHE [Kveton et al., 2019d]15

Proposed RandUCB, a generic meta-algorithm achieving thetheoretical performance of UCB and the practical performance of Thompson sampling.Paper: https://arxiv.org/abs/1910.04928Code: https://github.com/vaswanis/randucb15

ReferencesYasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In NIPS, 2011.Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In COLT, 2012.Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In ICML, 2013.Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The generalized linear case. In NIPS,2010.Aurélien Garivier and Olivier Cappé. The KL-UCB algorithm for bounded stochastic bandits and beyond. In COLT, 2011.Branislav Kveton, Csaba Szepesvári, Mohammad Ghavamzadeh, and Craig Boutilier. Perturbed-history exploration in stochasticlinear bandits. UAI, 2019a.Branislav Kveton, Csaba Szepesvári, Mohammad Ghavamzadeh, and Craig Boutilier. Perturbed-history exploration in stochasticmulti-armed bandits. In IJCAI-19, 2019b.Branislav Kveton, Csaba Szepesvári, Sharan Vaswani, Zheng Wen, Tor Lattimore, and Mohammad Ghavamzadeh. Garbage in,reward out: Bootstrapping exploration in multi-armed bandits. In ICML, 2019c.Branislav Kveton, Manzil Zaheer, Csaba Szepesvári, Lihong Li, Mohammad Ghavamzadeh, and Craig Boutilier. Randomizedexploration in generalized linear bandits. arXiv:1906.08947, 2019d.John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In NIPS, 2008.Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. In ICML, 2017.16

Bandits 101: problem setup Initialize the expected rewards according to someprior knowledge. for t 1 T do SELECT: Use abandit algorithmto decide which arm to pull. ACT and OBSERVE: Pull the selected arm and observe thereward. UPDATE: Update the estimated reward for the arm(s). end Stochastic bandits: Reward for e