• In the last lecture:
    • We approximated state value of action-value function using some parametric function, and then used an -greedy to generate a policy for control.
  • In this lecture:
    • We directly parametrize the policy and approximate it with some function approximator, i.e.:
    • Once again, focussing on model-free RL.
  • Why policy based RL though:
    • Better convergence.
    • Effective in high-dimensional or continuous action spaces - think a robot 6-DOF deciding how to move in 3D space.
    • Can learn stochastic policies - which is necessary at times - think rock-paper-scissor game - can’t be deterministic else we will be exploited and lose.
  • But, at the same time, policy based RL:
    • Typically (when it’s naive) converges to a local instead of a global optimum.
    • Evaluating a policy is typically inefficient and high variance.
  • Keep in mind that value based methods learn an almost deterministic policy - either greedy or -greedy.
    • A stochastic policy can help us reach states with high probability that wouldn’t otherwise be reached (or take a very long time) with value based policy.
    • In cased of POMDP where you do not have access to Markov states, a stochastic policy is sometimes the most optimal thing we could do.

Policy Objective Functions

  • What’s the goal?
    • For episodic tasks: maximize value from the start state
    • For continuing tasks: Expected return from the policy - for all states.
    • Or, we could use the reward averaged over number of time steps - we want a policy that gets us the most reward per time step.
  • Policy based RL is an optimization problem - we are trying to maximize some objective .
    • If a gradient is available, gradient based methods offer better efficiency over other methods.

Policy Gradient Ascent

  • Since our objective is total reward instead of a loss, we want to maximize this objective - hence gradient ascent.
  • Maths remains same except for the sign - here we move in the direction of the gradient.

Computing Gradient by Finite Differences Methods

  • In each dimension, perturb each parameter by a small amount , and the difference would give you an estimate of the gradient in that dimension.
  • Very computationally expensive - though simple.
  • Useful for non-differentiable policy.
  • Key idea - move a knob a little, see how much and in what direction it changes output, that gives you an estimate of the gradient.
  • For each dimension , estimate partial derivative by perturbing by a small in dimension ( being a one-hot encoded dimension indicator vector):

Score Function

  • We want to compute the policy gradient analytically.
  • Assume the policy is differentiable whenever it is non-zero.
  • We know the gradient to be:
  • Multiplying and dividing by the policy, and then seeing that :
  • We call the score function.
    • Note how the is the of a density - or the log likelihood - exactly what we maximize in Maximum Likelihood Estimation.
    • So the score is the gradient of the log likelihood.
  • See slides for some examples,
  • But the key intuition is: In general, the score function will be . This tells us “how much more than average am I doing something”. This tells us what to do more of and what to do less of.
  • But what do we do with this score? We convert an expectation of some quantity (say reward) into an expectation of the score (through taking gradients of the original quantity - which is what we need in a gradient method) - which we can sample easily and therefore manipulate easily.

One-Step MDP Policy Gradient

  • Start in some state .
  • Terminate after one step with some reward
  • What’s the policy gradient?
    • Objective function is

Policy Gradient Theorem

  • Replacing - the one step reward, with the action value function - gives us the gradient of the policy objective function for multi-step MDPs.
  • <INSERT EQN>

Actor Critic Policy Gradient

  • Key idea: In Monte Carlo Policy Gradient, samples of action-value function can have a lot of variance from episode to episode - so the gradient updates can be noisy. Let’s replace these with a parameterized approximate value of that we maintain separately.
  • So now we have two sets of parameters:
    • Critic: Approximates action value function parameter .
    • Actor: Updates policy parameters , in direction given by critic through the action-value function approximation.

Critic: Estimating the Action-Value Function

  • Familiar problem of policy evaluation
  • Can use Monte Carlo, TD(0), TD()
  • See this as another instance of Generalized Policy Improvement (GPI)
    • Start with an arbitrary policy and value function