• Instead of learning a value function or policy - can we learn the model of the environment directly from experience?
    • And then use that model to plan - look ahead and invoke our model to construct a policy/value function for a task.
  • A model tells us about:
    • The state transition probabilities
    • The reward function
  • Model free RL:
    • Learn value function (and/or policy) from experience
    • No explicit model
  • Model based RL:
    • Learn a model from experience
    • Plan value-function (and/or policy) from model

Model Based RL

  • Get some experience learn a model use the model to plan get a value function/policy use this policy to get more experience.
  • Advantage:
    • Can efficiently leverage supervised learning to learn model MDP - sometimes value function is very hard to learn but the model is relatively straightforward.
      • Think about chess - possible states to build a value function for, which is hard. But a model is basically just the rules of the game - relatively simple MDP. Then we can do things like tree search to figure out value function.
  • Disadvantage:
    • Two sources of errors - the error in building a model, and then the error in approximating a value function.
  • What is a model?
    • A representation of an MDP parameterized by
    • A model represents state transitions and rewards
    • Typically we assume conditional independence between state transitions and rewards
  • Goal: Learn the model from experience
    • This is supervised learning
    • Learning from is a regression problem - can use MSE loss
    • Learning is a density estimation problem - can use KL divergence
    • Find parameters that minimizes empirical loss
  • Possible examples of models:
    • Table lookup
    • Linear expectation model
    • Linear Gaussian model
    • Deep Belief Network model

Table Lookup Model

  • Keep counts of visits to each action pair - maintain simple mean of rewards and transition probabilities
  • Alternatively,
    • record all experience tuples as mappings from ,
    • and during inference when you are in state , simply sample at random uniformly from this mapping.

Planning with This Model

  • Planning solving the MDP finding the optimal value-function/policy/trajectory of actions
  • We’ve already seen how to do this given a model:
    • Value iteration
    • Policy iteration
    • Tree search
  • Let’s see a new one - Sample based planning

Sample Based Planning

  • Use the model to only generate samples - and then use model free RL techniques like MC, SARSA, Q-learning to learn from these samples.
  • Sample based planning methods are usually more efficient

Planning with an Inaccurate Model

  • Upper bound of model-based RL performance Optimal policy on the learned MDP
    • In other words, we’ll only be as good as the model we learnt
  • What if our model is very inaccurate?
    • Use model-free RL :)
    • Reason explicitly about model uncertainty

Integrated Architectures

  • Two sources of experience:
    • Real experience: Actually interacting with the environment
    • Simulated experience: Invoking our learned model and sampling experiences from that
  • What if we could combine Model-free RL and Model-based RL?
  • We get the Dyna architecture:
    • Learn model from real experience
    • Learn and plan value-function/policy from both real and simulated experience

Dyna-Q Algorithm

  • Initialize and for all
  • Repeat indefinitely:
    1. current state
    2. -greedy
    3. Execute ; get and state
    4. Do the usual Q-learning update:
    5. Update the model (by supervised learning):
    6. Repeat times (sampling from the model):
      1. Sample a random previously seen state
      2. Sample a random action taken previously in
      3. Sample new reward and next state from model:
      4. Q-learning step with simulated reward and state:
  • Key idea: Solving the whole MDP is a waste of time - focus on the current state for now and solve the sub-MDP starting from current state.
  • Build a search tree with current state as root.
  • Use a model of the MDP to look ahead.
  • Key idea: Forward search + Sampling
  • Rooted at the current state, build the search tree using simulated experience from the learned model
  • Then apply your favorite model-free RL algorithm to simulated episodes to get a search algorithm
    • Monte Carlo control Monte Carlo search
    • SARSA TD search
  • Given a model and simulated policy
  • For each action
    • Simulate episodes from current state
    • mean return of the episodes
  • Choose current (real) action

Monte Carlo Tree Search (MCTS)

  • Key idea: Do Monte Carlo simulations from the current state, then explore all possibilities for the more promising nodes. Save the -values based on this sort of selective sampling. Use the -values from this sampling to guide our search - we expand the nodes which are most promising. So as to not completely ignore the un-promising parts, add an element of exploration.
  • Algorithmic idea: Why just evaluate for the root state - let’s do it also for all the (simulated) child states/actions. Build a search tree containing all visited (in simulation) states and actions and store values at each node.
    • In other words: Apply Monte Carlo control to the (simulated) sub-MDP from now.
  • Converges on the optimal search tree.
  • Estimation by taking the mean of the returns from that state-action pair, then maximize over the values.
  • But we don’t have values for the entire space - only the search tree we have explored so far. So we break the simulation into two phases:
    • When you’re in the tree (have values for the state): choose action , and improve policy
    • When you’re outside the tree (haven’t seen these states before): policy is fixed - pick actions randomly
  • In every simulation:
    • Evaluate states by Monte-Carlo evaluation
    • Improve the tree policy - by -greedy

TD Search - Bootstrapping Our Simulations

  • Key idea: Apply SARSA to the (simulated) sub-MDP from now.
  • Why? Cuz bootstrapping is a great idea when there’s a chance you may have visited the state through another path in the tree - so you already know something about the return from that state.
  • All the same benefits over MC search as that of TD control over MC control:
    • Less variance (but more bias)
    • More efficient

Dyna-2

  • Key idea: Store two sets of feature weights:
    • Long-term memory - updated from real experience using TD learning - represents general knowledge about the state that applies to any episode.
    • Short-term memory - updated from simulated experience using TD search - represents specific local knowledge about the current situation/episode.
  • Our value function is sum of these two functions.