Multi-Armed Bandit Problem

  • Equivalent to a single state MDP. One episode is just a single timestep - choosing one of the levers to pull.
  • Each lever has a different expected payout - we don’t know what it is.
  • Goal is to maximize cumulative reward over a long period of time.
  • Formally, a multi armed bandit is a tuple
    • is the action set (or which arm to pull)
    • is an unknown distribution of rewards or payouts
    • At each time step , agent picks an action , and the environment generates a reward
  • The action-value is the mean reward for action
  • The optimal value is
  • The regret is how much worse we did compared to the optimal action - the opportunity loss for one step
  • Total regret is the total opportunity loss
  • Goal = maximise reward = minimize regret

Counting Regret

  • Count = = number of times we pull the lever and get action
  • Gap = = Difference between the value of the current action and the optimal action =
  • And, we have regret
  • Then, we can write regret as a function of counts and gaps - and write is as a summation over actions
  • So, to have a small regret, we need small counts for large gaps. But we don’t know , and therefore don’t know the gaps.

Greedy Approach

  • Idea: Estimate through Monte Carlo and averaging the returns.
  • Then, just select the highest value action
  • Problem: Can lock on to a suboptimal action forever.
  • Outcome: Linear total regret.

Optimistic Initialization

  • Idea: Initialize all to a the maximum possible value (say ). Everything is good until proven otherwise.
  • Update values through incremental Monte Carlo, with also set to a high value (say )
  • Then act greedily
  • Encourages exploration of unknown values until they are really proven to be bad.
  • But same problem: can be unlucky (though not as frequently as greedy) still and lock onto suboptimal values. So again, linear total regret.

-greedy

  • Be greedy with random action with probability
  • Constant greedy is still linear
  • Decaying -greedy is logarithmic!

Lower Bound