Multi-Armed Bandit Problem
- Equivalent to a single state MDP. One episode is just a single timestep - choosing one of the m levers to pull.
- Each lever has a different expected payout - we don’t know what it is.
- Goal is to maximize cumulative reward over a long period of time.
- Formally, a multi armed bandit is a tuple ⟨A,R⟩
- A is the action set (or which arm to pull)
- Ra=P[R=a∣A=a] is an unknown distribution of rewards or payouts
- At each time step t, agent picks an action At∈A, and the environment generates a reward Rt∈R
- The action-value is the mean reward for action a
Q(a)=E[r∣a]
- The optimal value v∗ is
V∗=Q(a∗)=a∈AmaxQ(a)
- The regret is how much worse we did compared to the optimal action - the opportunity loss for one step
It=E[V∗−Q(at)]
- Total regret is the total opportunity loss
Lt=E[τ=1∑tV∗−Q(aτ)]
- Goal = maximise reward = minimize regret
Counting Regret
- Count = Nt(a) = number of times we pull the lever and get action a
- Gap = ∇a = Difference between the value of the current action and the optimal action = V∗−Q(a)
- And, we have regret
Lt=E[τ=1∑tV∗−Q(aτ)]
- Then, we can write regret as a function of counts and gaps - and write is as a summation over actions
Lt=a∈A∑E[Nt(a)](V∗−Q(a))=a∈A∑E[Nt(a)]Δa
- So, to have a small regret, we need small counts for large gaps. But we don’t know V∗, and therefore don’t know the gaps.
Greedy Approach
- Idea: Estimate Q(a) through Monte Carlo and averaging the returns.
- Then, just select the highest value action
at∗=a∈AargmaxQ^t(a)
- Problem: Can lock on to a suboptimal action forever.
- Outcome: Linear total regret.
Optimistic Initialization
- Idea: Initialize all Q(a) to a the maximum possible value (say rmax). Everything is good until proven otherwise.
- Update values through incremental Monte Carlo, with Nt(a) also set to a high value (say Ninitial)
Q^t(at)=Q^t−1+Nt(at)1(rt−Q^t−1)
- Then act greedily
At=argmaxa∈AQt(a)
- Encourages exploration of unknown values until they are really proven to be bad.
- But same problem: can be unlucky (though not as frequently as greedy) still and lock onto suboptimal values. So again, linear total regret.
ϵ-greedy
- Be greedy with random action with probability ϵ
- Constant ϵ greedy is still linear
- Decaying ϵ-greedy is logarithmic!
Lower Bound