RL 04: Model-Free Prediction

What if we don’t know the exact way the environment behaves? We know it’s a MDP, but we’re not given the parameters governing the MDP - all the state transitions and rewards? This is the model free case.
Two major classes for model-free prediction:
- Monte Carlo: Sample a few trajectories and calculate expected returns.
- Temporal Difference Learning: One step lookahead and estimating the rewards recursively from there on.

Monte Carlo Learning

Learn directly from complete episodes of experience.
- Sample a bunch of episodes and calculate average sample returns per state to get the value functions.
- So $v_{π} (s)$ = empirical mean of (discounted) return over the sampled episodes under the policy $π$
- Can only be applied to episodic tasks - episodes must terminate.

Two ways to do it:

First visit Monte Carlo: Take returns only from the point of first visit till the end of episode - even if there’s a loop in the trajectory. Take mean of these sample returns over many episodes.

n = [0 for s in states]
v = [0 for s in states]
for e in episodes:
    visited = [0 for s in states]
    for s in e.nextState():
        v[s] = e.getTotalReturnFromHere()
        if visited[s] == 0: # visiting for the first time
            n[s] = n[s]+1
            visited[s] = 1 # mark as visited
return [vs/ns for vs,ns in zip(v, n)] # the value function

Every visit Monte Carlo: If there are loops in an episode trajectory, increment counter at every visit to that state. Then take mean as usual.

n = [0 for s in states]
v = [0 for s in states]
for e in episodes:
    for s in e.nextState():
        v[s] = e.getTotalReturnFromHere()
        n[s] = n[s]+1
return [vs/ns for vs,ns in zip(v, n)] # the value function

Incremental Monte Carlo

Target = The next sample of the quantity whose mean we are trying to calculate. Here, it’s $G_{t}$ , the actual return for the state $S_{t}$ for this episode.
Estimate of the value function $v_{π}$ = $V$
- Can also write as $V_{n}$ , where $n$ denotes that we are updating our estimate for the $n^{t h}$ time.
- Or $V_{n} (S_{t})$ to denote the $n^{t h}$ estimate for the state occurring at $t^{t h}$ timestep for the current episode.
Error = Target - Current Mean = $G_{t}$ - $V$
Means can be written in incremental form as: $V \leftarrow V + \frac{1}{n} \times (G_{t} - V) = Old mean + Step Size \times Error Term$
To give more importance to recent terms (if for example we are trying to learn a non-stationary policy), we can make the step size some constant $α$ - to get an Exponential Moving Average: $V_{n} = V_{n - 1} + α \times (G_{t} - V_{n - 1})$

Temporal-Difference Learning

Temporal = time: We learn from the difference between values at successive timesteps.
TD learns from incomplete episodes - it bootstraps.
We do a lookahead and then substitute our own guess for the remainder of the trajectory - thus we are updating a guess using a guess. This estimate is thus biased.
Think of TD as DP with sampling. Because:
- Like Monte Carlo, it uses sample returns to estimate
- Like DP, it uses bootstrapping, i.e., uses an existing estimate to calculate a new estimate
Alternatively, think of TD as doing piecewise updates to the estimates instead of waiting for the entire picture to emerge. We update our estimates with whatever information we can as soon as we have it.

V/s Monte Carlo

	Monte Carlo	TD(0) - one-step TD
Target	$G_{t}$ - the actual return at the end of the episode	$R_{t + 1} + γV (S_{t + 1})$ - current timestep reward + estimate for next state
Error	$G_{t} - V (S_{t})$	$R_{t + 1} + γV (S_{t + 1}) - V (S_{t})$
When do updates happen?	Needs to wait until episode completion.	Can learn from even a single timestep.
When can we use it?	Works for episodic tasks only.	Works for episodic as well as continuing tasks.
Bias/variance	High variance (because of noise in state transitions), low bias	High Bias (because we take a guess to estimate), low variance (only one noisy state transition)
Convergence	Good convergence	TD(0) provably converges, but TD( $λ$ ) can be unstable
Converges with fn approximators?	Yes	Not always
Efficiency?	Less efficient	Usually converges faster, but hasn’t been proven to be a faster method.
Markov Property	Does not exploit Markov Property. That also means it does not need the process to be Markov.	TD exploits the Markov property - and it therefore more efficient.

Batch Updating

Instead of applying updates at every step (and thus updating our estimates), we can calculate the updates (the increments - $α \times E rror$ ) for each step, but apply them (and thus update the estimate) only after a batch of data (which can be a bunch of timesteps or episodes). For TD, this means that we do not use the absolute fresh estimate available in error calculation. For Monte Carlo this means that we do not update the value function estimate after each episode. This is called batch updating.

Certainty Equivalence Estimate

Batch Monte Carlo minimizes the mean squared error on the training set.
Batch TD(0) finds estimates that would be correct for maximum likelihood model of the Markov Process. It in some sense builds the markov process that is most likely to have generated the data.
- This is called the certainty-equivalence estimate.

Bootstrap/Sampling Grid

DP: Bootstrap but not sampling - full width shallow backups
MC: Sampling but not bootstrapping - deep sample backups
TD: Bootstrapping and sampling - shallow sample backups
Exhaustive search: No bootstrapping, no sampling - full-width deep backups

TD( $λ$ )

We saw TD with one step of reality (or lookahead) and then backup with the estimate
Why not 2 steps of reality? $n$ steps of reality?
Turns out we can and that works.
In the limit of $λ = \infty$ it becomes Monte Carlo.

$n$ -step Returns

$G_{t}^{(1)} = R_{t + 1} + γV (S_{t + 1})$
$G_{t}^{(2)} = R_{t + 1} + γ R_{t + 2} + γ^{2} V (S_{t + 2})$
…and so on
$n$ -step return: $G_{t}^{(n)} = R_{t + 1} + γ R_{t + 2} + \dots + γ^{n - 1} R_{t + n} + γ^{n} V (S_{t + n})$
$n$ -step TD learning: $V (S_{t}) \leftarrow V (S_{t}) + α \times [G_{t}^{(n)} - V (S_{t})]$
Which $n$ is the best?
- While $n > 1$ is usually better, no single $n$ is always best - in fact optimal $n$ is dependent on number of data and also online or offline learning.
What if we averaged some $n$ -step returns? Say averaging $G^{(2)}$ and $G^{(4)}$ ?
- Gives better estimates than either one individually.
Can we combine all such $n$ -step returns?
- We can - we can take a geometric weighted return of all the $n$ -step returns. $G_{t}^{λ} = (1 - λ) [G_{t}^{(1)} + λ G_{t}^{(2)} + λ^{2} G_{t}^{(2)} \dots] = (1 - λ) n = 1 \sum \infty λ^{n - 1} G_{t}^{(n)}$
  - We use geometric weightings because they of the memorylessness property - which makes computing TD( $λ$ ) possible without storing anything extra.
- The first $(1 - λ)$ term is to make the coefficients sum to $1$ .
- If the episode terminates, the terminating step is given a weight equivalent to the remaining budget of weights we had for the rest of the infinite sequence - so that the sum is still $1$ .
  - If the terminating timestep is $T$ , the weight is $λ^{T - t - 1}$ .
Forward view TD( $λ$ ): $V (S_{t}) \leftarrow V (S_{t}) + α \times [G_{t}^{λ} - V (S_{t})]$
Eligibility Traces
- How do we assign credit to events for the reward we just received? Is it all because of our most recent state, or is it due to the most occurring state in the sequence of states? How do we decided how far back to assign credit for the reward we just received?
- Eligibility traces combine both frequency heuristic and recency heuristic.

Adarsh's Notes

Explorer

RL 04: Model-Free Prediction

Monte Carlo Learning

Incremental Monte Carlo

Temporal-Difference Learning

V/s Monte Carlo

Batch Updating

Certainty Equivalence Estimate

Bootstrap/Sampling Grid

TD( $λ$ )

$n$ -step Returns

Table of Contents

Backlinks

Adarsh's Notes

Explorer

RL 04: Model-Free Prediction

Monte Carlo Learning

Incremental Monte Carlo

Temporal-Difference Learning

V/s Monte Carlo

Batch Updating

Certainty Equivalence Estimate

Bootstrap/Sampling Grid

TD(λ)

n-step Returns

Table of Contents

Backlinks

TD( $λ$ )

$n$ -step Returns