- What if we don’t know the exact way the environment behaves? We know it’s a MDP, but we’re not given the parameters governing the MDP - all the state transitions and rewards? This is the model free case.
- Two major classes for model-free prediction:
- Monte Carlo: Sample a few trajectories and calculate expected returns.
- Temporal Difference Learning: One step lookahead and estimating the rewards recursively from there on.
Monte Carlo Learning
- Learn directly from complete episodes of experience.
- Sample a bunch of episodes and calculate average sample returns per state to get the value functions.
- So = empirical mean of (discounted) return over the sampled episodes under the policy
- Can only be applied to episodic tasks - episodes must terminate.
- Two ways to do it:
-
First visit Monte Carlo: Take returns only from the point of first visit till the end of episode - even if there’s a loop in the trajectory. Take mean of these sample returns over many episodes.
n = [0 for s in states] v = [0 for s in states] for e in episodes: visited = [0 for s in states] for s in e.nextState(): v[s] = e.getTotalReturnFromHere() if visited[s] == 0: # visiting for the first time n[s] = n[s]+1 visited[s] = 1 # mark as visited return [vs/ns for vs,ns in zip(v, n)] # the value function -
Every visit Monte Carlo: If there are loops in an episode trajectory, increment counter at every visit to that state. Then take mean as usual.
n = [0 for s in states] v = [0 for s in states] for e in episodes: for s in e.nextState(): v[s] = e.getTotalReturnFromHere() n[s] = n[s]+1 return [vs/ns for vs,ns in zip(v, n)] # the value function
-
Incremental Monte Carlo
- Target = The next sample of the quantity whose mean we are trying to calculate. Here, it’s , the actual return for the state for this episode.
- Estimate of the value function =
- Can also write as , where denotes that we are updating our estimate for the time.
- Or to denote the estimate for the state occurring at timestep for the current episode.
- Error = Target - Current Mean = -
- Means can be written in incremental form as:
- To give more importance to recent terms (if for example we are trying to learn a non-stationary policy), we can make the step size some constant - to get an Exponential Moving Average:
Temporal-Difference Learning
- Temporal = time: We learn from the difference between values at successive timesteps.
- TD learns from incomplete episodes - it bootstraps.
- We do a lookahead and then substitute our own guess for the remainder of the trajectory - thus we are updating a guess using a guess. This estimate is thus biased.
- Think of TD as DP with sampling. Because:
- Like Monte Carlo, it uses sample returns to estimate
- Like DP, it uses bootstrapping, i.e., uses an existing estimate to calculate a new estimate
- Alternatively, think of TD as doing piecewise updates to the estimates instead of waiting for the entire picture to emerge. We update our estimates with whatever information we can as soon as we have it.
V/s Monte Carlo
| Monte Carlo | TD(0) - one-step TD | |
|---|---|---|
| Target | - the actual return at the end of the episode | - current timestep reward + estimate for next state |
| Error | ||
| When do updates happen? | Needs to wait until episode completion. | Can learn from even a single timestep. |
| When can we use it? | Works for episodic tasks only. | Works for episodic as well as continuing tasks. |
| Bias/variance | High variance (because of noise in state transitions), low bias | High Bias (because we take a guess to estimate), low variance (only one noisy state transition) |
| Convergence | Good convergence | TD(0) provably converges, but TD() can be unstable |
| Converges with fn approximators? | Yes | Not always |
| Efficiency? | Less efficient | Usually converges faster, but hasn’t been proven to be a faster method. |
| Markov Property | Does not exploit Markov Property. That also means it does not need the process to be Markov. | TD exploits the Markov property - and it therefore more efficient. |
Batch Updating
- Instead of applying updates at every step (and thus updating our estimates), we can calculate the updates (the increments - ) for each step, but apply them (and thus update the estimate) only after a batch of data (which can be a bunch of timesteps or episodes). For TD, this means that we do not use the absolute fresh estimate available in error calculation. For Monte Carlo this means that we do not update the value function estimate after each episode. This is called batch updating.
Certainty Equivalence Estimate
- Batch Monte Carlo minimizes the mean squared error on the training set.
- Batch TD(0) finds estimates that would be correct for maximum likelihood model of the Markov Process. It in some sense builds the markov process that is most likely to have generated the data.
- This is called the certainty-equivalence estimate.
Bootstrap/Sampling Grid
- DP: Bootstrap but not sampling - full width shallow backups
- MC: Sampling but not bootstrapping - deep sample backups
- TD: Bootstrapping and sampling - shallow sample backups
- Exhaustive search: No bootstrapping, no sampling - full-width deep backups
TD()
- We saw TD with one step of reality (or lookahead) and then backup with the estimate
- Why not 2 steps of reality? steps of reality?
- Turns out we can and that works.
- In the limit of it becomes Monte Carlo.
-step Returns
- …and so on
- -step return:
- -step TD learning:
- Which is the best?
- While is usually better, no single is always best - in fact optimal is dependent on number of data and also online or offline learning.
- What if we averaged some -step returns? Say averaging and ?
- Gives better estimates than either one individually.
- Can we combine all such -step returns?
- We can - we can take a geometric weighted return of all the -step returns.
- We use geometric weightings because they of the memorylessness property - which makes computing TD() possible without storing anything extra.
- The first term is to make the coefficients sum to .
- If the episode terminates, the terminating step is given a weight equivalent to the remaining budget of weights we had for the rest of the infinite sequence - so that the sum is still .
- If the terminating timestep is , the weight is .
- We can - we can take a geometric weighted return of all the -step returns.
- Forward view TD():
- Eligibility Traces
- How do we assign credit to events for the reward we just received? Is it all because of our most recent state, or is it due to the most occurring state in the sequence of states? How do we decided how far back to assign credit for the reward we just received?
- Eligibility traces combine both frequency heuristic and recency heuristic.