RL 06: Value Function Approximation

For most real problems, the numbers of states and actions are too large to build state/action value function “tables” - storing a value for each state
- Even if we could store - processing them would be a too slow/intractable.
Can we build approximate functions for state/action value functions? Yes, NNs are universal function estimators.
We are going to build a parametric function $v (s, w)$ / $q (s, a, w)$ / $q (s, w)$ with weights $w$ that approximates $v (s)$ / $q (s, a)$ / $q (s)$ (this last quantity is an array of $q (s, a)$ for all $a$ ).
- This lets us not only not have to store all values, but also generalize from seen to unknown states.
We update parameter $w$ with MC or TD.

What Function Estimator to Use

Possible candidates:
- NN
- Linear combinations of features
- Decision Trees
- Nearest neighbor
- Fourier/ wavelet bases
Properties we seek:
- Differentiable - so we can use gradient descent to incrementally adjust params
- Suitable for non-stationary, non-i.i.d. data - because we keep exploring new areas, new rewards are getting discovered, so value function and policy keeps changing.

Gradient Descent Methods

Linear Value Function Approximation

Input: The state, and the true value function (let’s say we have an oracle for this)
- We can represent states by a feature vector $x (S)$
Output: Estimated value function
- Is then just $\overset{v}{^} (S, w) = x (S)^{⊤} w$ - a simple linear transformation on states.
Loss function $J (w)$ : MSE b/w true value function and estimated value function
- Is convex for Linear Approximation, so no local minima to deal with.

Table lookup

is equivalent to linear approximation with one-hot encoding of states in the feature vector. In this case, we end up just ‘selecting’ a value from the parameter vector, which is, a lookup.

Working without an Oracle

We use MC or TD to substitute the return $G_{t}$ for the target of our value function.
- So we do supervised learning on the MC returns or TD targets.

Monte-Carlo with Value Function Approximation

MC returns are our training data - the return $G_{t}$ are unbiased, noisy sample of true value of $v_{π} (S_{t})$

TD Learning with Value Function Approximation

TD target $R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w)$ is a biased sample of true value of $v_{π} (S_{t})$
- Biased because we use the same approximator in getting the target values.
Can still use them as training data.
Remember that this is all incremental - we are updating the approximator at each step of “training data”, and then using the updated weights to generate TD targets in this case.
Linear TD(0) converges (close) to global optimum. (even though it is biased)

Control with Value Function Approximation

Action Value Function Approximation

Input: The state, action and the true action value function (let’s say we have an oracle for this)
- We can represent states by a feature vector $x (S, A)$
Output: Estimated value function
- Is then just $\overset{q}{^} (S, A, w) = x (S, A)^{⊤} w$ - a simple linear transformation on states.
Loss function $J (w)$ : MSE b/w true action value function and estimated action value function.
Same machinery as Value function for MC and TD targets works for action-value function too.

Convergence of Prediction Algorithms

TD can blow up in certain cases instead of converging.
- TD does not follow gradient of any objective function.
Gradient TD follows true gradient of projected Bellman error and always converges.

On/Off-Policy	Algorithm	Table Lookup	Linear	Non-Linear
On-Policy	MC	$✓$	$✓$	$✓$
	$TD (0)$	$✓$	$✓$	$x$
	$TD (λ)$	$✓$	$✓$	$x$
Off-Policy	MC	$✓$	$✓$	$✓$
	$TD (0)$	$✓$	$x$	$x$
	$TD (λ)$	$✓$	$x$	$x$

Convergence of Control Algorithms

Algorithm	Table Lookup	Linear	Non-Linear
Monte-Carlo Control	$✓$	$(✓)$	$x$
Sarsa	$✓$	$(✓)$	$x$
Q-learning	$✓$	$x$	$x$
Gradient Q-learning	$✓$	$✓$	$x$

$(✓) =$ chatters (flirts) around near-optimal value function

Batch Methods

Iterative methods are not sample efficient - we don’t learn a whole lot from seeing an experience just once - since we only make small updates.
Solution: Experience Replay
- Store all experience data ( $⟨ s t a t e, v a l u e ⟩$ pairs) from the oracle/expert policy. $D = {⟨ s_{1}, v_{1}^{π} ⟩, ⟨ s_{2}, v_{2}^{π} ⟩, \dots, ⟨ s_{T}, v_{T}^{π} ⟩}$
- Sample from this dataset.
- Applying SGD update over batched data from the experience replay in a supervised learning fashion with MSE loss function gives us a neat least squares optimization over some “expert/oracle” policy.

DQN: Deep Q-Networks

Proposed in Mnih et al. (2013).
Two key ideas:
- Experience replay and mini batch SGD breaks the correlation b/w valued in sequential time step values and also improves sample efficiency since we see each sample multiple times.
- Computing Q-learning targets wrt a network frozen a little while back (say 1000 steps ago) - this stabilizes the network and prevents blowup. Swap after the number of steps with new network and continue.
Steps:
- Take greedy action according to $ϵ$ greedy policy
- Store the experiences in replay memory/dataset $D$
- Sample mini batch from $D$
- Compute Q-learning targets wrt old fixed NN params.
- Do a mini batch SGD optimization step.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing Atari with Deep Reinforcement Learning. arXiv. https://doi.org/10.48550/arXiv.1312.5602

Adarsh's Notes

Explorer

RL 06: Value Function Approximation

What Function Estimator to Use

Gradient Descent Methods

Linear Value Function Approximation

Working without an Oracle

Monte-Carlo with Value Function Approximation

TD Learning with Value Function Approximation

Control with Value Function Approximation

Action Value Function Approximation

Convergence of Prediction Algorithms

Convergence of Control Algorithms

Batch Methods

DQN: Deep Q-Networks

Table of Contents

Backlinks