• On policy vs Off policy: Learning on the job vs learning while using someone else’s policy
  • In most real problems,
    • MDP model is unknown - but experience can be sampled.
    • MDP model is known - but is too big to use, except by sampling, so it doesn’t matter anyway.
  • On policy learning:
    • Learn on the job
    • Learn about policy from experience sampled from - sample actions from it while at the same time evaluating the policy.
  • Off policy learning:
    • Looking over someone’s shoulder
    • Learn about (evaluate) policy from experience sampled from another behaviour policy

On Policy Learning

  • Revision: Policy iteration is
    1. Evaluate a policy’s value function
    2. Improve the policy by being greedy on this value function
    3. repeat until convergence to optimal policy

Monte Carlo Control

Monte-Carlo Evaluation

  • Can we plug in Monte-Carlo to evaluate value function and then iterate and improve the policy?
  • Two problems with this:
    • State Value function evaluation (such as via Bellman expectation backup) needs state transition probabilities - the dynamics model of the MDP. We don’t have that in model-free.
      • Alternative is to use Action value function, evaluate with Monte Carlo for state-action pair values, and then choose a greedy policy, which allows us to do control in a model-free way.
    • But, greedy policy means we may not explore the entire state space - so we may get stuck and not see the states and actions that contribute to the correct estimate of the value function.
      • So instead of being greedy, we choose to be -greedy. We choose the greedy option most of the times, but with a small probability , we choose a completely random action from the set of all actions.
      • This, asymptotically, guarantees we will explore all possible actions.
  • Just like before in Value Iteration - we can just do a few steps of policy evaluation instead of all the way. What does it look like for Monte Carlo?
    • We improve the policy for each episode - each episode is considered one round of policy evaluation.

GLIE (Greedy in the Limit of Infinite Exploration)

  • GLIE entails 2 conditions:
    • You make sure that asymptotically, all state-action pairs are explored infinitely many times.
    • Eventually, the policy converges on a greedy (not just -greedy) policy
  • One way to achieve this is to decay to with some schedule, say , being the number of episode sampled.

GLIE Monte Carlo Control

  • Sample episode
  • Incrementally update the state-action value-function estimate mean for each state-action pair in the sample episode.
  • Let new policy be -greedy on this estimated action value function with
  • Your policy will eventually converge to

TD Control

v/s MC

  • Advantages:
    • TD has lower variance
    • Online
    • Incomplete sequences
  • So, the natural idea then is:
    • Apply TD to
    • Use -greedy policy improvement
    • Update every time step instead of every episode.

SARSA: On-Policy Control

  • For each episode, initialize , (arbitrarily), and from (using -greedy). Then, for each step of the episode:
    1. We start with a state-action pair (no sampling yet)
    2. We want to update under our -greedy policy.
    3. We take action and observe/sample from the environment the reward we get and the state we end up in.
    4. Sample from our own policy (hence on policy) the next action
    5. Make this update.
    6. Repeat from for the new pair in the next step.
  • SARSA converges to the optimal policy if:
    • GLIE sequence of policies
    • Robbins-Munro sequence of step-sizes - they are not too small and not too large.

-step SARSA

Same as -step TD but with state-action-value functions.

  • …and so on
  • -step return:
  • -step SARSA learning:

SARSA()

  • Same as TD() - weighted sum of all -step returns.
  • Again, to keep it online, we can’t do Forward SARSA(), because that needs us to go till the end of episode, which may not exist.
    • Again, we build some eligibility traces to do a backward view.

Off-policy Learning

  • Evaluate target policy while following
  • Why?
    • Learn about the environment through experiences of another agent - including possibly a human.
    • Re-use experiences from another policy.
    • Learn about optimal policy while following exploratory policy (to cover the state-action space better)
    • Learn about multiple policies while following one policy

Q-Learning: Off Policy Control

  • We allow both target and behaviour policies to improve.

  • Target policy is greedy w.r.t. and behaviour policy is -greedy w.r.t.

  • This means,

    • When choosing what action to take, use the -greedy behaviour policy . This ensures exploration.
    • When making an update to the -value, use the greedy policy for TD target (choose the max over all actions in ). This ensures optimality.
  • For each episode, initialize , (arbitrarily). Then, for each step of the episode:

    1. Sample an action from (using -greedy policy )
    2. We take action and observe/sample from the environment the reward we get and the state we end up in.
    3. Make this update using sampled from our target policy
    4. Repeat from for the new in the next step.