- On policy vs Off policy: Learning on the job vs learning while using someone else’s policy
- In most real problems,
- MDP model is unknown - but experience can be sampled.
- MDP model is known - but is too big to use, except by sampling, so it doesn’t matter anyway.
- On policy learning:
- Learn on the job
- Learn about policy from experience sampled from - sample actions from it while at the same time evaluating the policy.
- Off policy learning:
- Looking over someone’s shoulder
- Learn about (evaluate) policy from experience sampled from another behaviour policy
On Policy Learning
- Revision: Policy iteration is
- Evaluate a policy’s value function
- Improve the policy by being greedy on this value function
- repeat until convergence to optimal policy
Monte Carlo Control
Monte-Carlo Evaluation
- Can we plug in Monte-Carlo to evaluate value function and then iterate and improve the policy?
- Two problems with this:
- State Value function evaluation (such as via Bellman expectation backup) needs state transition probabilities - the dynamics model of the MDP. We don’t have that in model-free.
- Alternative is to use Action value function, evaluate with Monte Carlo for state-action pair values, and then choose a greedy policy, which allows us to do control in a model-free way.
- But, greedy policy means we may not explore the entire state space - so we may get stuck and not see the states and actions that contribute to the correct estimate of the value function.
- So instead of being greedy, we choose to be -greedy. We choose the greedy option most of the times, but with a small probability , we choose a completely random action from the set of all actions.
- This, asymptotically, guarantees we will explore all possible actions.
- State Value function evaluation (such as via Bellman expectation backup) needs state transition probabilities - the dynamics model of the MDP. We don’t have that in model-free.
- Just like before in Value Iteration - we can just do a few steps of policy evaluation instead of all the way. What does it look like for Monte Carlo?
- We improve the policy for each episode - each episode is considered one round of policy evaluation.
GLIE (Greedy in the Limit of Infinite Exploration)
- GLIE entails 2 conditions:
- You make sure that asymptotically, all state-action pairs are explored infinitely many times.
- Eventually, the policy converges on a greedy (not just -greedy) policy
- One way to achieve this is to decay to with some schedule, say , being the number of episode sampled.
GLIE Monte Carlo Control
- Sample episode
- Incrementally update the state-action value-function estimate mean for each state-action pair in the sample episode.
- Let new policy be -greedy on this estimated action value function with
- Your policy will eventually converge to
TD Control
v/s MC
- Advantages:
- TD has lower variance
- Online
- Incomplete sequences
- So, the natural idea then is:
- Apply TD to
- Use -greedy policy improvement
- Update every time step instead of every episode.
SARSA: On-Policy Control
- For each episode, initialize , (arbitrarily), and from (using -greedy). Then, for each step of the episode:
- We start with a state-action pair (no sampling yet)
- We want to update under our -greedy policy.
- We take action and observe/sample from the environment the reward we get and the state we end up in.
- Sample from our own policy (hence on policy) the next action
- Make this update.
- Repeat from for the new pair in the next step.
- SARSA converges to the optimal policy if:
- GLIE sequence of policies
- Robbins-Munro sequence of step-sizes - they are not too small and not too large.
-step SARSA
Same as -step TD but with state-action-value functions.
- …and so on
- -step return:
- -step SARSA learning:
SARSA()
- Same as TD() - weighted sum of all -step returns.
- Again, to keep it online, we can’t do Forward SARSA(), because that needs us to go till the end of episode, which may not exist.
- Again, we build some eligibility traces to do a backward view.
Off-policy Learning
- Evaluate target policy while following
- Why?
- Learn about the environment through experiences of another agent - including possibly a human.
- Re-use experiences from another policy.
- Learn about optimal policy while following exploratory policy (to cover the state-action space better)
- Learn about multiple policies while following one policy
Q-Learning: Off Policy Control
-
We allow both target and behaviour policies to improve.
-
Target policy is greedy w.r.t. and behaviour policy is -greedy w.r.t.
-
This means,
- When choosing what action to take, use the -greedy behaviour policy . This ensures exploration.
- When making an update to the -value, use the greedy policy for TD target (choose the max over all actions in ). This ensures optimality.
-
For each episode, initialize , (arbitrarily). Then, for each step of the episode:
- Sample an action from (using -greedy policy )
- We take action and observe/sample from the environment the reward we get and the state we end up in.
- Make this update using sampled from our target policy
- Repeat from for the new in the next step.