RL 03: Planning by Dynamic Programming

Dynamic programming:
- breaks a problem down to subproblems
- solves them and puts it all together
The problem needs to have:
- an optimal substructure - ability for optimal substructure to be broken down into subproblems
- overlapping subproblems - so by solving one subproblem we solve a part of the next bigger subproblem
MDPs have both these properties
- Bellman equations automatically recursively define the value function in terms of a smaller value function
- Composition is just adding current reward to the next optimal subproblem
DP assumes the full knowledge of MDP - reward and state structure
- So we are just planning, not actually solving the full RL problem
Two parts of this:
- Prediction: Given an MDP and a policy $π$ , solve its state value function $v_{π}$
- Control: Given an MDP, solve for the optimal policy $π$

Policy Evaluation

Problem: Find the $v_{π}$ for a given policy $π$
Given: an MDP and a policy $π$
Solution: start with any value function (all $0$ s work), then iteratively apply Bellman expectation backup (cuz we back up all the evaluated values at leaves to compute higher nodes recursively) until convergence ( $v_{i + 1} = v_{i}$ )

Policy Iteration

Problem: find the optimal policy
Given: an MDP and a starting policy $π$
Solution:
1. Evaluate the policy by computing the value function like in the previous section
2. Improve the policy be being greedy wrt this value function, i.e., new policy is doing whatever leads to the higher value state. i.e., $π^{'} = greedy (v_{π})$ . This is called one step of policy improvement.
3. Repeat policy evaluation and policy improvement alternately until policy (or value function) converges, i.e., $π_{i + 1} = π_{i}$
No matter which policy you start with, you end up with an optimal deterministic policy.

Modified Policy Iteration

We don’t really need to evaluate the value function all the way to convergence. We could stop once we are within $ϵ$ of the last value function.
Or, we could also just do $k$ steps of value function iteration and then change the policy according to the new value function.
When $k = 1$ , i.e., we update the policy after every value function update, it’s called Value Iteration.

Value Iteration

Problem: find the optimal policy
Given: MDP, and a starting policy
Solution: Iterative application of Bellman optimality backup
1. Equivalent to doing 1 step of policy evaluation, then greedy policy on that, then another

Value iteration is not the same as policy iteration.

The intermediate value functions may not correspond to the value function of any real policy. Why?
In policy iteration, we do policy evaluation till the value of the value-function converges. Then update the policy. Then do policy evaluation again till convergence. Every successive value function is larger than the previous one. Here, we are just building intermediate values without waiting till convergence, greedily choosing actions along the way. This may not represent the value function of a real policy - in the sense that it hasn’t converged. It is just an intermediate value function.

Summary

Problem	Bellman Equation	Algorithm
Prediction	Bellman Expectation Equation	Iterative Policy Evaluation
Control	Bellman Expectation Equation + Greedy Policy Improvement	Policy Iteration
Control	Bellman Optimality Equation	Value Iteration

Some Considerations

So far, we discussed synchronous backups - we would keep 2 copies of the value function for all states and use the old copy to update the new copy.
Async DP is possible too - we keep one copy, and update values with whatever value of another state is available - old or new. This reduces computation.
Can also use a priority queue to first update the states with the largest Bellman error - the difference between old and new values
When state space gets really really large - say $2^{100}$ - we can’t use full width backups - considering the entire breathe of the tree.
- Then, we sample one sequence of states and actions and apply Bellman there.
- This leads to a model-free computation - we don’t know the entire reward space - we just explore or sample and work with that.

Adarsh's Notes

Explorer

RL 03: Planning by Dynamic Programming

Policy Evaluation

Policy Iteration

Modified Policy Iteration

Value Iteration

Summary

Some Considerations

Table of Contents

Backlinks