We approximated state value of action-value function using some parametric function, and then used an ϵ-greedy to generate a policy for control.
In this lecture:
We directly parametrize the policy and approximate it with some function approximator, i.e.:
πθ(s,a)=P[a∣s,θ]
Once again, focussing on model-free RL.
Why policy based RL though:
Better convergence.
Effective in high-dimensional or continuous action spaces - think a robot 6-DOF deciding how to move in 3D space.
Can learn stochastic policies - which is necessary at times - think rock-paper-scissor game - can’t be deterministic else we will be exploited and lose.
But, at the same time, policy based RL:
Typically (when it’s naive) converges to a local instead of a global optimum.
Evaluating a policy is typically inefficient and high variance.
Keep in mind that value based methods learn an almost deterministic policy - either greedy or ϵ-greedy.
A stochastic policy can help us reach states with high probability that wouldn’t otherwise be reached (or take a very long time) with value based policy.
In cased of POMDP where you do not have access to Markov states, a stochastic policy is sometimes the most optimal thing we could do.
Policy Objective Functions
What’s the goal?
For episodic tasks: maximize value from the start state
For continuing tasks: Expected return from the policy - value of each state×probability of ending up in that statefor all states.
Or, we could use the reward averaged over number of time steps - we want a policy that gets us the most reward per time step.
Policy based RL is an optimization problem - we are trying to maximize some objective J(θ).
If a gradient is available, gradient based methods offer better efficiency over other methods.
Policy Gradient Ascent
Since our objective J(θ) is total reward instead of a loss, we want to maximize this objective - hence gradient ascent.
Maths remains same except for the sign - here we move in the direction of the gradient.
Computing Gradient by Finite Differences Methods
In each dimension, perturb each parameter by a small amount ϵ, and the difference would give you an estimate of the gradient in that dimension.
Very computationally expensive - though simple.
Useful for non-differentiable policy.
Key idea - move a knob a little, see how much and in what direction it changes output, that gives you an estimate of the gradient.
For each dimension k∈[1,n], estimate kth partial derivative by perturbing θ by a small ϵ in kth dimension (u being a one-hot encoded dimension indicator vector):
∂θk∂J(θ)≈ϵJ(θ+ϵuk)−J(θ)
Score Function
We want to compute the policy gradient analytically.
Assume the policy πθ is differentiable whenever it is non-zero.
We know the gradient to be: ∇θπθ(s,a)
Multiplying and dividing by the policy, and then seeing that x∂x=∂(logx):
∇θπθ(s,a)=πθ(s,a)∇θlogπθ(s,a)
We call ∇θlogπθ(s,a) the score function.
Note how the logπθ(s,a) is the log of a density - or the log likelihood - exactly what we maximize in Maximum Likelihood Estimation.
So the score is the gradient of the log likelihood.
See slides for some examples,
But the key intuition is: In general, the score function will be ⟨this action−mean action⟩. This tells us “how much more than average am I doing something”. This tells us what to do more of and what to do less of.
But what do we do with this score? We convert an expectation of some quantity (say reward) into an expectation of the score (through taking gradients of the original quantity - which is what we need in a gradient method) - which we can sample easily and therefore manipulate easily.
One-Step MDP Policy Gradient
Start in some state s∼d(s).
Terminate after one step with some reward r=Rs,a
What’s the policy gradient?
Objective function is J(θ)=Eπθ[r]=∑s∈Sd(s)∑a∈Aπθ(s,a)Rs,a
Policy Gradient Theorem
Replacing r - the one step reward, with the action value function - gives us the gradient of the policy objective function for multi-step MDPs.
<INSERT EQN>
Actor Critic Policy Gradient
Key idea: In Monte Carlo Policy Gradient, samples vt of action-value function Qπθ(st,at) can have a lot of variance from episode to episode - so the gradient updates can be noisy. Let’s replace these vt with a parameterized approximate value of Qw(s,a) that we maintain separately.
So now we have two sets of parameters:
Critic: Approximates action value function parameter w.
Actor: Updates policy parameters θ, in direction given by critic through the action-value function approximation.
Critic: Estimating the Action-Value Function
Familiar problem of policy evaluation
Can use Monte Carlo, TD(0), TD(λ)
See this as another instance of Generalized Policy Improvement (GPI)