• Dynamic programming:
    • breaks a problem down to subproblems
    • solves them and puts it all together
  • The problem needs to have:
    • an optimal substructure - ability for optimal substructure to be broken down into subproblems
    • overlapping subproblems - so by solving one subproblem we solve a part of the next bigger subproblem
  • MDPs have both these properties
    • Bellman equations automatically recursively define the value function in terms of a smaller value function
    • Composition is just adding current reward to the next optimal subproblem
  • DP assumes the full knowledge of MDP - reward and state structure
    • So we are just planning, not actually solving the full RL problem
  • Two parts of this:
    • Prediction: Given an MDP and a policy , solve its state value function
    • Control: Given an MDP, solve for the optimal policy

Policy Evaluation

  • Problem: Find the for a given policy
  • Given: an MDP and a policy
  • Solution: start with any value function (all s work), then iteratively apply Bellman expectation backup (cuz we back up all the evaluated values at leaves to compute higher nodes recursively) until convergence ()

Policy Iteration

  • Problem: find the optimal policy
  • Given: an MDP and a starting policy
  • Solution:
    1. Evaluate the policy by computing the value function like in the previous section
    2. Improve the policy be being greedy wrt this value function, i.e., new policy is doing whatever leads to the higher value state. i.e., . This is called one step of policy improvement.
    3. Repeat policy evaluation and policy improvement alternately until policy (or value function) converges, i.e.,
  • No matter which policy you start with, you end up with an optimal deterministic policy.

Modified Policy Iteration

  • We don’t really need to evaluate the value function all the way to convergence. We could stop once we are within of the last value function.
  • Or, we could also just do steps of value function iteration and then change the policy according to the new value function.
  • When , i.e., we update the policy after every value function update, it’s called Value Iteration.

Value Iteration

  • Problem: find the optimal policy
  • Given: MDP, and a starting policy
  • Solution: Iterative application of Bellman optimality backup
    1. Equivalent to doing 1 step of policy evaluation, then greedy policy on that, then another

Value iteration is not the same as policy iteration.

The intermediate value functions may not correspond to the value function of any real policy. Why?
In policy iteration, we do policy evaluation till the value of the value-function converges. Then update the policy. Then do policy evaluation again till convergence. Every successive value function is larger than the previous one. Here, we are just building intermediate values without waiting till convergence, greedily choosing actions along the way. This may not represent the value function of a real policy - in the sense that it hasn’t converged. It is just an intermediate value function.

Summary

ProblemBellman EquationAlgorithm
PredictionBellman Expectation EquationIterative Policy Evaluation
ControlBellman Expectation Equation + Greedy Policy ImprovementPolicy Iteration
ControlBellman Optimality EquationValue Iteration

Some Considerations

  • So far, we discussed synchronous backups - we would keep 2 copies of the value function for all states and use the old copy to update the new copy.
  • Async DP is possible too - we keep one copy, and update values with whatever value of another state is available - old or new. This reduces computation.
  • Can also use a priority queue to first update the states with the largest Bellman error - the difference between old and new values
  • When state space gets really really large - say - we can’t use full width backups - considering the entire breathe of the tree.
    • Then, we sample one sequence of states and actions and apply Bellman there.
    • This leads to a model-free computation - we don’t know the entire reward space - we just explore or sample and work with that.