RL 07: Policy Gradient Methods

In the last lecture:
- We approximated state value of action-value function using some parametric function, and then used an $ϵ$ -greedy to generate a policy for control.
In this lecture:
- We directly parametrize the policy and approximate it with some function approximator, i.e.: $π_{θ} (s, a) = P [a ∣ s, θ]$
- Once again, focussing on model-free RL.
Why policy based RL though:
- Better convergence.
- Effective in high-dimensional or continuous action spaces - think a robot 6-DOF deciding how to move in 3D space.
- Can learn stochastic policies - which is necessary at times - think rock-paper-scissor game - can’t be deterministic else we will be exploited and lose.
But, at the same time, policy based RL:
- Typically (when it’s naive) converges to a local instead of a global optimum.
- Evaluating a policy is typically inefficient and high variance.
Keep in mind that value based methods learn an almost deterministic policy - either greedy or $ϵ$ -greedy.
- A stochastic policy can help us reach states with high probability that wouldn’t otherwise be reached (or take a very long time) with value based policy.
- In cased of POMDP where you do not have access to Markov states, a stochastic policy is sometimes the most optimal thing we could do.

Policy Objective Functions

What’s the goal?
- For episodic tasks: maximize value from the start state
- For continuing tasks: Expected return from the policy - $value of each state \times probability of ending up in that state$ for all states.
- Or, we could use the reward averaged over number of time steps - we want a policy that gets us the most reward per time step.
Policy based RL is an optimization problem - we are trying to maximize some objective $J (θ)$ .
- If a gradient is available, gradient based methods offer better efficiency over other methods.

Since our objective $J (θ)$ is total reward instead of a loss, we want to maximize this objective - hence gradient ascent.
Maths remains same except for the sign - here we move in the direction of the gradient.

In each dimension, perturb each parameter by a small amount $ϵ$ , and the difference would give you an estimate of the gradient in that dimension.
Very computationally expensive - though simple.
Useful for non-differentiable policy.
Key idea - move a knob a little, see how much and in what direction it changes output, that gives you an estimate of the gradient.
For each dimension $k \in [1, n]$ , estimate $k^{t h}$ partial derivative by perturbing $θ$ by a small $ϵ$ in $k^{t h}$ dimension ( $u$ being a one-hot encoded dimension indicator vector): $\frac{\partial J ( θ )}{\partial θ _{k}} \approx \frac{J ( θ + ϵ u _{k} ) - J ( θ )}{ϵ}$

We want to compute the policy gradient analytically.
Assume the policy $π_{θ}$ is differentiable whenever it is non-zero.
We know the gradient to be: $\nabla_{θ} π_{θ} (s, a)$
Multiplying and dividing by the policy, and then seeing that $\frac{\partial x}{x} = \partial (lo g x)$ : $\nabla_{θ} π_{θ} (s, a) = π_{θ} (s, a) \nabla_{θ} lo g π_{θ} (s, a)$
We call $\nabla_{θ} lo g π_{θ} (s, a)$ the score function.
- Note how the $lo g π_{θ} (s, a)$ is the $lo g$ of a density - or the log likelihood - exactly what we maximize in Maximum Likelihood Estimation.
- So the score is the gradient of the log likelihood.
See slides for some examples,
But the key intuition is: In general, the score function will be $⟨ this action - mean action ⟩$ . This tells us “how much more than average am I doing something”. This tells us what to do more of and what to do less of.
But what do we do with this score? We convert an expectation of some quantity (say reward) into an expectation of the score (through taking gradients of the original quantity - which is what we need in a gradient method) - which we can sample easily and therefore manipulate easily.

Start in some state $s \sim d (s)$ .
Terminate after one step with some reward $r = R_{s, a}$
What’s the policy gradient?
- Objective function is $J (θ) = E_{π_{θ}} [r] = \sum_{s \in S} d (s) \sum_{a \in A} π_{θ} (s, a) R_{s, a}$

Replacing $r$ - the one step reward, with the action value function - gives us the gradient of the policy objective function for multi-step MDPs.
<INSERT EQN>

Key idea: In Monte Carlo Policy Gradient, samples $v_{t}$ of action-value function $Q^{π_{θ}} (s_{t}, a_{t})$ can have a lot of variance from episode to episode - so the gradient updates can be noisy. Let’s replace these $v_{t}$ with a parameterized approximate value of $Q_{w} (s, a)$ that we maintain separately.
So now we have two sets of parameters:
- Critic: Approximates action value function parameter $w$ .
- Actor: Updates policy parameters $θ$ , in direction given by critic through the action-value function approximation.

Familiar problem of policy evaluation
Can use Monte Carlo, TD(0), TD( $λ$ )
See this as another instance of Generalized Policy Improvement (GPI)
- Start with an arbitrary policy and value function