Instead of learning a value function or policy - can we learn the model of the environment directly from experience?
And then use that model to plan - look ahead and invoke our model to construct a policy/value function for a task.
A model tells us about:
The state transition probabilities Pss′a
The reward function Rsa
Model free RL:
Learn value function (and/or policy) from experience
No explicit model
Model based RL:
Learn a model from experience
Plan value-function (and/or policy) from model
Model Based RL
Get some experience → learn a model → use the model to plan → get a value function/policy → use this policy to get more experience.
Advantage:
Can efficiently leverage supervised learning to learn model MDP - sometimes value function is very hard to learn but the model is relatively straightforward.
Think about chess - ≈1045 possible states to build a value function for, which is hard. But a model is basically just the rules of the game - relatively simple MDP. Then we can do things like tree search to figure out value function.
Disadvantage:
Two sources of errors - the error in building a model, and then the error in approximating a value function.
What is a model?
A representation of an MDP ⟨S,A,P,R⟩ parameterized by η
A model M=⟨Pη,Rη⟩ represents state transitions and rewardsSt+1Rt+1∼Pη(St+1∣St,At)=Rη(Rt+1∣St,At)
Typically we assume conditional independence between state transitions and rewards
P[St+1,Rt+1∣St,At]=P[St+1∣St,At]P[Rt+1∣St,At]
Goal: Learn the model Mη from experience {S1,A1,R2,S2,…}
This is supervised learning
Learning r from s,a is a regression problem - can use MSE loss
Learning s,a→s′ is a density estimation problem - can use KL divergence
Find parameters η that minimizes empirical loss
Possible examples of models:
Table lookup
Linear expectation model
Linear Gaussian model
Deep Belief Network model
Table Lookup Model
Keep counts of visits to each action pair - maintain simple mean of rewards and transition probabilities
Alternatively,
record all experience tuples {St,At,Rt+1,St+1} as mappings from {St,At}→{Rt+1,St+1},
and during inference when you are in state {St,At}, simply sample at random uniformly from this mapping.
Planning with This Model
Planning → solving the MDP → finding the optimal value-function/policy/trajectory of actions
We’ve already seen how to do this given a model:
Value iteration
Policy iteration
Tree search
Let’s see a new one - Sample based planning
Sample Based Planning
Use the model to only generate samples - and then use model free RL techniques like MC, SARSA, Q-learning to learn from these samples.
Sample based planning methods are usually more efficient
Planning with an Inaccurate Model
Upper bound of model-based RL performance → Optimal policy on the learned MDP
In other words, we’ll only be as good as the model we learnt
What if our model is very inaccurate?
Use model-free RL :)
Reason explicitly about model uncertainty
Integrated Architectures
Two sources of experience:
Real experience: Actually interacting with the environment
Simulated experience: Invoking our learned model and sampling experiences from that
What if we could combine Model-free RL and Model-based RL?
We get the Dyna architecture:
Learn model from real experience
Learn and plan value-function/policy from both real and simulated experience
Dyna-Q Algorithm
Initialize Q(s,a) and Model(s,a) for all s∈A,a∈A
Repeat indefinitely:
current state →S
ϵ-greedy →A
Execute A; get R and state S′
Do the usual Q-learning update: Q(S,A)←Q(S,A)+α[R+γmaxaQ(S′,a)−Q(S,A)]
Update the model (by supervised learning): Model(S,A)←R,S′
Repeat n times (sampling from the model):
Sample a random previously seen state S
Sample a random action taken previously in S
Sample new reward and next state from model: Model(S,A)→R,S′
Q-learning step with simulated reward and state: Q(S,A)←Q(S,A)+α[R+γmaxaQ(S′,a)−Q(S,A)]
Forward Search
Key idea: Solving the whole MDP is a waste of time - focus on the current state for now and solve the sub-MDP starting from current state.
Build a search tree with current state st as root.
Use a model of the MDP to look ahead.
Simulation-Based Search
Key idea: Forward search + Sampling
Rooted at the current state, build the search tree using simulated experience from the learned model
Then apply your favorite model-free RL algorithm to simulated episodes to get a search algorithm
Monte Carlo control → Monte Carlo search
SARSA → TD search
Simple Monte Carlo Search
Given a model M and simulated policy π
For each action a∈A
Simulate k episodes from current state st
Q(st,a)← mean return of the k episodes
Choose current (real) action a=a∈AargmaxQ(s,a)
Monte Carlo Tree Search (MCTS)
Key idea: Do Monte Carlo simulations from the current state, then explore all possibilities for the more promising nodes. Save the q-values based on this sort of selective sampling. Use the q-values from this sampling to guide our search - we expand the nodes which are most promising. So as to not completely ignore the un-promising parts, add an element of exploration.
Algorithmic idea: Why just evaluate Q(s,a) for the root state st - let’s do it also for all the (simulated) child states/actions. Build a search tree containing all visited (in simulation) states and actions and store q values at each node.
In other words: Apply Monte Carlo control to the (simulated) sub-MDP from now.
Converges on the optimal search tree.
Estimation by taking the mean of the returns from that state-action pair, then maximize over the q values.
But we don’t have q values for the entire space - only the search tree we have explored so far. So we break the simulation into two phases:
When you’re in the tree (have q values for the state): choose action a=a∈AargmaxQ(s,a), and improve policy
When you’re outside the tree (haven’t seen these states before): policy is fixed - pick actions randomly
In every simulation:
Evaluate states Q(S,A) by Monte-Carlo evaluation
Improve the tree policy - by ϵ-greedy
TD Search - Bootstrapping Our Simulations
Key idea: Apply SARSA to the (simulated) sub-MDP from now.
Why? Cuz bootstrapping is a great idea when there’s a chance you may have visited the state through another path in the tree - so you already know something about the return from that state.
All the same benefits over MC search as that of TD control over MC control:
Less variance (but more bias)
More efficient
Dyna-2
Key idea: Store two sets of feature weights:
Long-term memory - updated from real experience using TD learning - represents general knowledge about the state that applies to any episode.
Short-term memory - updated from simulated experience using TD search - represents specific local knowledge about the current situation/episode.