RL 05: Model Free Control

On policy vs Off policy: Learning on the job vs learning while using someone else’s policy
In most real problems,
- MDP model is unknown - but experience can be sampled.
- MDP model is known - but is too big to use, except by sampling, so it doesn’t matter anyway.
On policy learning:
- Learn on the job
- Learn about policy $π$ from experience sampled from $π$ - sample actions from it while at the same time evaluating the policy.
Off policy learning:
- Looking over someone’s shoulder
- Learn about (evaluate) policy $π$ from experience sampled from another behaviour policy $μ$

On Policy Learning

Revision: Policy iteration is
1. Evaluate a policy’s value function
2. Improve the policy by being greedy on this value function
3. repeat until convergence to optimal policy

Monte Carlo Control

Monte-Carlo Evaluation

Can we plug in Monte-Carlo to evaluate value function and then iterate and improve the policy?
Two problems with this:
- State Value function evaluation (such as via Bellman expectation backup) needs state transition probabilities - the dynamics model of the MDP. We don’t have that in model-free.
  - Alternative is to use Action value function, evaluate with Monte Carlo for state-action pair values, and then choose a greedy policy, which allows us to do control in a model-free way.
  $π^{'} (s) = a \in A argmax q (s, a)$
- But, greedy policy means we may not explore the entire state space - so we may get stuck and not see the states and actions that contribute to the correct estimate of the value function.
  - So instead of being greedy, we choose to be $ϵ$ -greedy. We choose the greedy option most of the times, but with a small probability $ϵ$ , we choose a completely random action from the set of all actions. $π (a ∣ s) = ⎩ ⎨ ⎧ \frac{ϵ}{m} + 1 - ϵ \frac{ϵ}{m} if a = a \in A argmax q_{*} (s, a) otherwise$
  - This, asymptotically, guarantees we will explore all possible actions.
Just like before in Value Iteration - we can just do a few steps of policy evaluation instead of all the way. What does it look like for Monte Carlo?
- We improve the policy for each episode - each episode is considered one round of policy evaluation.

GLIE (Greedy in the Limit of Infinite Exploration)

GLIE entails 2 conditions:
- You make sure that asymptotically, all state-action pairs are explored infinitely many times.
- Eventually, the policy converges on a greedy (not just $ϵ$ -greedy) policy
One way to achieve this is to decay $ϵ$ to $0$ with some schedule, say $ϵ = \frac{1}{k}$ , $k$ being the number of episode sampled.

GLIE Monte Carlo Control

Sample $k^{t h}$ episode
Incrementally update the state-action value-function estimate mean for each state-action pair in the sample episode.
Let new policy be $ϵ$ -greedy on this estimated action value function with $ϵ = \frac{1}{k}$
Your policy will eventually converge to $q_{*} (s, a)$

TD Control

v/s MC

Advantages:
- TD has lower variance
- Online
- Incomplete sequences
So, the natural idea then is:
- Apply TD to $Q (s, a)$
- Use $ϵ$ -greedy policy improvement
- Update every time step instead of every episode.

SARSA: On-Policy Control

For each episode, initialize $S$ , $Q$ (arbitrarily), and $A$ from $Q$ (using $ϵ$ -greedy). Then, for each step of the episode:
1. We start with a state-action pair $S, A$ (no sampling yet)
2. We want to update $Q (S, A)$ under our $ϵ$ -greedy policy.
3. We take action $A$ and observe/sample from the environment the reward $R$ we get and the state $S^{'}$ we end up in.
4. Sample from our own policy (hence on policy) the next action $A^{'}$
5. Make this update. $Q (S, A) \leftarrow Q (S, A) + α (R + γ Q (S^{'}, A^{'}) - Q (S, A))$
6. Repeat from $(1)$ for the new pair $S^{'}, A^{'}$ in the next step.
SARSA converges to the optimal policy $q_{*} (s, a)$ if:
- GLIE sequence of policies
- Robbins-Munro sequence of step-sizes $α_{t}$ - they are not too small and not too large. $t = 1 \sum \infty α_{t} = \infty t = 1 \sum \infty α_{t}^{2} < \infty$

$n$ -step SARSA

Same as $n$ -step TD but with state-action-value functions.

$q_{t}^{(1)} = R_{t + 1} + γ Q (S_{t + 1})$
$q_{t}^{(2)} = R_{t + 1} + γ R_{t + 2} + γ^{2} Q (S_{t + 2})$
…and so on
$n$ -step return: $q_{t}^{(n)} = R_{t + 1} + γ R_{t + 2} + \dots + γ^{n - 1} R_{t + n} + γ^{n} Q (S_{t + n})$
$n$ -step SARSA learning: $Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α \times [q_{t}^{(n)} - Q (S_{t}, A_{t})]$

SARSA( $λ$ )

Same as TD( $λ$ ) - weighted sum of all $n$ -step returns. $Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α \times [q_{t}^{λ} - Q (S_{t}, A_{t})]$
Again, to keep it online, we can’t do Forward SARSA( $λ$ ), because that needs us to go till the end of episode, which may not exist.
- Again, we build some eligibility traces to do a backward view.

Off-policy Learning

Evaluate target policy $π (a ∣ s)$ while following $μ (a ∣ s)$
Why?
- Learn about the environment through experiences of another agent - including possibly a human.
- Re-use experiences from another policy.
- Learn about optimal policy while following exploratory policy (to cover the state-action space better)
- Learn about multiple policies while following one policy

Q-Learning: Off Policy Control

We allow both target and behaviour policies to improve.
Target policy $π$ is greedy w.r.t. $Q (S, A)$ and behaviour policy $μ$ is $ϵ$ -greedy w.r.t. $Q (S, A)$
This means,
- When choosing what action to take, use the $ϵ$ -greedy behaviour policy $μ$ . This ensures exploration.
- When making an update to the $Q$ -value, use the greedy policy $π$ for TD target (choose the max $Q$ over all actions in $S^{'}$ ). This ensures optimality.
For each episode, initialize $S$ , $Q$ (arbitrarily). Then, for each step of the episode:
1. Sample an action $A$ from $Q$ (using $ϵ$ -greedy policy $μ$ )
2. We take action $A$ and observe/sample from the environment the reward $R$ we get and the state $S^{'}$ we end up in.
3. Make this update using $a^{'} = a argmax Q (S^{'}, a)$ sampled from our target policy $π$ $Q (S, A) \leftarrow Q (S, A) + α (R + γ a^{'} max Q (S^{'}, a^{'}) - Q (S, A))$
4. Repeat from $(1)$ for the new $S^{'}$ in the next step.

Adarsh's Notes

Explorer

RL 05: Model Free Control

On Policy Learning

Monte Carlo Control

Monte-Carlo Evaluation

GLIE (Greedy in the Limit of Infinite Exploration)

GLIE Monte Carlo Control

TD Control

v/s MC

SARSA: On-Policy Control

$n$ -step SARSA

SARSA( $λ$ )

Off-policy Learning

Q-Learning: Off Policy Control

Table of Contents

Backlinks

Adarsh's Notes

Explorer

RL 05: Model Free Control

On Policy Learning

Monte Carlo Control

Monte-Carlo Evaluation

GLIE (Greedy in the Limit of Infinite Exploration)

GLIE Monte Carlo Control

TD Control

v/s MC

SARSA: On-Policy Control

n-step SARSA

SARSA(λ)

Off-policy Learning

Q-Learning: Off Policy Control

Table of Contents

Backlinks

$n$ -step SARSA

SARSA( $λ$ )