RL 08: Integrating Learning and Planning

Instead of learning a value function or policy - can we learn the model of the environment directly from experience?
- And then use that model to plan - look ahead and invoke our model to construct a policy/value function for a task.
A model tells us about:
- The state transition probabilities $P_{s s^{'}}^{a}$
- The reward function $R_{s}^{a}$
Model free RL:
- Learn value function (and/or policy) from experience
- No explicit model
Model based RL:
- Learn a model from experience
- Plan value-function (and/or policy) from model

Model Based RL

Get some experience $\to$ learn a model $\to$ use the model to plan $\to$ get a value function/policy $\to$ use this policy to get more experience.
Advantage:
- Can efficiently leverage supervised learning to learn model MDP - sometimes value function is very hard to learn but the model is relatively straightforward.
  - Think about chess - $\approx 1 0^{45}$ possible states to build a value function for, which is hard. But a model is basically just the rules of the game - relatively simple MDP. Then we can do things like tree search to figure out value function.
Disadvantage:
- Two sources of errors - the error in building a model, and then the error in approximating a value function.
What is a model?
- A representation of an MDP $⟨ S, A, P, R ⟩$ parameterized by $η$
- A model $M = ⟨ P_{η}, R_{η} ⟩$ represents state transitions and rewards $S_{t + 1} R_{t + 1} \sim P_{η} (S_{t + 1} ∣ S_{t}, A_{t}) = R_{η} (R_{t + 1} ∣ S_{t}, A_{t})$
- Typically we assume conditional independence between state transitions and rewards $P [S_{t + 1}, R_{t + 1} ∣ S_{t}, A_{t}] = P [S_{t + 1} ∣ S_{t}, A_{t}] P [R_{t + 1} ∣ S_{t}, A_{t}]$
Goal: Learn the model $M_{η}$ from experience ${S_{1}, A_{1}, R_{2}, S_{2}, \dots}$
- This is supervised learning
- Learning $r$ from $s, a$ is a regression problem - can use MSE loss
- Learning $s, a \to s^{'}$ is a density estimation problem - can use KL divergence
- Find parameters $η$ that minimizes empirical loss
Possible examples of models:
- Table lookup
- Linear expectation model
- Linear Gaussian model
- Deep Belief Network model

Table Lookup Model

Keep counts of visits to each action pair - maintain simple mean of rewards and transition probabilities
Alternatively,
- record all experience tuples ${S_{t}, A_{t}, R_{t + 1}, S_{t + 1}}$ as mappings from ${S_{t}, A_{t}} \to {R_{t + 1}, S_{t + 1}}$ ,
- and during inference when you are in state ${S_{t}, A_{t}}$ , simply sample at random uniformly from this mapping.

Planning with This Model

Planning $\to$ solving the MDP $\to$ finding the optimal value-function/policy/trajectory of actions
We’ve already seen how to do this given a model:
- Value iteration
- Policy iteration
- Tree search
Let’s see a new one - Sample based planning

Sample Based Planning

Use the model to only generate samples - and then use model free RL techniques like MC, SARSA, Q-learning to learn from these samples.
Sample based planning methods are usually more efficient

Planning with an Inaccurate Model

Upper bound of model-based RL performance $\to$ Optimal policy on the learned MDP
- In other words, we’ll only be as good as the model we learnt
What if our model is very inaccurate?
- Use model-free RL :)
- Reason explicitly about model uncertainty

Integrated Architectures

Two sources of experience:
- Real experience: Actually interacting with the environment
- Simulated experience: Invoking our learned model and sampling experiences from that
What if we could combine Model-free RL and Model-based RL?
We get the Dyna architecture:
- Learn model from real experience
- Learn and plan value-function/policy from both real and simulated experience

Dyna-Q Algorithm

Initialize $Q (s, a)$ and $M o d e l (s, a)$ for all $s \in A, a \in A$
Repeat indefinitely:
1. current state $\to S$
2. $ϵ$ -greedy $\to A$
3. Execute $A$ ; get $R$ and state $S^{'}$
4. Do the usual Q-learning update: $Q (S, A) \leftarrow Q (S, A) + α [R + γ max_{a} Q (S^{'}, a) - Q (S, A)]$
5. Update the model (by supervised learning): $M o d e l (S, A) \leftarrow R, S^{'}$
6. Repeat $n$ times (sampling from the model):
  1. Sample a random previously seen state $S$
  2. Sample a random action taken previously in $S$
  3. Sample new reward and next state from model: $M o d e l (S, A) \to R, S^{'}$
  4. Q-learning step with simulated reward and state: $Q (S, A) \leftarrow Q (S, A) + α [R + γ max_{a} Q (S^{'}, a) - Q (S, A)]$

Forward Search

Key idea: Solving the whole MDP is a waste of time - focus on the current state for now and solve the sub-MDP starting from current state.
Build a search tree with current state $s_{t}$ as root.
Use a model of the MDP to look ahead.

Simulation-Based Search

Key idea: Forward search + Sampling
Rooted at the current state, build the search tree using simulated experience from the learned model
Then apply your favorite model-free RL algorithm to simulated episodes to get a search algorithm
- Monte Carlo control $\to$ Monte Carlo search
- SARSA $\to$ TD search

Simple Monte Carlo Search

Given a model $M$ and simulated policy $π$
For each action $a \in A$
- Simulate $k$ episodes from current state $s_{t}$
- $Q (s_{t}, a) \leftarrow$ mean return of the $k$ episodes
Choose current (real) action $a = a \in A argmax Q (s, a)$

Monte Carlo Tree Search (MCTS)

Key idea: Do Monte Carlo simulations from the current state, then explore all possibilities for the more promising nodes. Save the $q$ -values based on this sort of selective sampling. Use the $q$ -values from this sampling to guide our search - we expand the nodes which are most promising. So as to not completely ignore the un-promising parts, add an element of exploration.
Algorithmic idea: Why just evaluate $Q (s, a)$ for the root state $s_{t}$ - let’s do it also for all the (simulated) child states/actions. Build a search tree containing all visited (in simulation) states and actions and store $q$ values at each node.
- In other words: Apply Monte Carlo control to the (simulated) sub-MDP from now.
Converges on the optimal search tree.
Estimation by taking the mean of the returns from that state-action pair, then maximize over the $q$ values.
But we don’t have $q$ values for the entire space - only the search tree we have explored so far. So we break the simulation into two phases:
- When you’re in the tree (have $q$ values for the state): choose action $a = a \in A argmax Q (s, a)$ , and improve policy
- When you’re outside the tree (haven’t seen these states before): policy is fixed - pick actions randomly
In every simulation:
- Evaluate states $Q (S, A)$ by Monte-Carlo evaluation
- Improve the tree policy - by $ϵ$ -greedy

TD Search - Bootstrapping Our Simulations

Key idea: Apply SARSA to the (simulated) sub-MDP from now.
Why? Cuz bootstrapping is a great idea when there’s a chance you may have visited the state through another path in the tree - so you already know something about the return from that state.
All the same benefits over MC search as that of TD control over MC control:
- Less variance (but more bias)
- More efficient

Dyna-2

Key idea: Store two sets of feature weights:
- Long-term memory - updated from real experience using TD learning - represents general knowledge about the state that applies to any episode.
- Short-term memory - updated from simulated experience using TD search - represents specific local knowledge about the current situation/episode.
Our value function is sum of these two functions.

Adarsh's Notes

Explorer

RL 08: Integrating Learning and Planning

Model Based RL

Table Lookup Model

Planning with This Model

Sample Based Planning

Planning with an Inaccurate Model

Integrated Architectures

Dyna-Q Algorithm

Forward Search

Simulation-Based Search

Simple Monte Carlo Search

Monte Carlo Tree Search (MCTS)

TD Search - Bootstrapping Our Simulations

Dyna-2

Table of Contents

Backlinks