RL 09: Exploration and Exploitation

Multi-Armed Bandit Problem

Equivalent to a single state MDP. One episode is just a single timestep - choosing one of the $m$ levers to pull.
Each lever has a different expected payout - we don’t know what it is.
Goal is to maximize cumulative reward over a long period of time.
Formally, a multi armed bandit is a tuple $⟨ A, R ⟩$
- $A$ is the action set (or which arm to pull)
- $R^{a} = P [R = a ∣ A = a]$ is an unknown distribution of rewards or payouts
- At each time step $t$ , agent picks an action $A_{t} \in A$ , and the environment generates a reward $R_{t} \in R$
The action-value is the mean reward for action $a$ $Q (a) = E [r ∣ a]$
The optimal value $v_{*}$ is $V^{*} = Q (a^{*}) = a \in A max Q (a)$
The regret is how much worse we did compared to the optimal action - the opportunity loss for one step $I_{t} = E [V^{*} - Q (a_{t})]$
Total regret is the total opportunity loss $L_{t} = E [τ = 1 \sum t V^{*} - Q (a_{τ})]$
Goal = maximise reward = minimize regret

Count = $N_{t} (a)$ = number of times we pull the lever and get action $a$
Gap = $\nabla_{a}$ = Difference between the value of the current action and the optimal action = $V^{*} - Q (a)$
And, we have regret $L_{t} = E [τ = 1 \sum t V^{*} - Q (a_{τ})]$
Then, we can write regret as a function of counts and gaps - and write is as a summation over actions $L_{t} = a \in A \sum E [N_{t} (a)] (V^{*} - Q (a)) = a \in A \sum E [N_{t} (a)] Δ_{a}$
So, to have a small regret, we need small counts for large gaps. But we don’t know $V^{*}$ , and therefore don’t know the gaps.

Idea: Estimate $Q (a)$ through Monte Carlo and averaging the returns.
Then, just select the highest value action $a_{t}^{*} = a \in A argmax \hat{Q}_{t} (a)$
Problem: Can lock on to a suboptimal action forever.
Outcome: Linear total regret.

Idea: Initialize all $Q (a)$ to a the maximum possible value (say $r_{ma x}$ ). Everything is good until proven otherwise.
Update values through incremental Monte Carlo, with $N_{t} (a)$ also set to a high value (say $N_{ini t ia l}$ ) $\hat{Q}_{t} (a_{t}) = \hat{Q}_{t - 1} + \frac{1}{N _{t} ( a _{t} )} (r_{t} - \hat{Q}_{t - 1})$
Then act greedily $A_{t} = argma x_{a \in A} Q_{t} (a)$
Encourages exploration of unknown values until they are really proven to be bad.
But same problem: can be unlucky (though not as frequently as greedy) still and lock onto suboptimal values. So again, linear total regret.