• For most real problems, the numbers of states and actions are too large to build state/action value function “tables” - storing a value for each state
    • Even if we could store - processing them would be a too slow/intractable.
  • Can we build approximate functions for state/action value functions? Yes, NNs are universal function estimators.
  • We are going to build a parametric function // with weights that approximates // (this last quantity is an array of for all ).
    • This lets us not only not have to store all values, but also generalize from seen to unknown states.
  • We update parameter with MC or TD.

What Function Estimator to Use

  • Possible candidates:
    • NN
    • Linear combinations of features
    • Decision Trees
    • Nearest neighbor
    • Fourier/ wavelet bases
  • Properties we seek:
    • Differentiable - so we can use gradient descent to incrementally adjust params
    • Suitable for non-stationary, non-i.i.d. data - because we keep exploring new areas, new rewards are getting discovered, so value function and policy keeps changing.

Gradient Descent Methods

Linear Value Function Approximation

  • Input: The state, and the true value function (let’s say we have an oracle for this)
    • We can represent states by a feature vector
  • Output: Estimated value function
    • Is then just - a simple linear transformation on states.
  • Loss function : MSE b/w true value function and estimated value function
    • Is convex for Linear Approximation, so no local minima to deal with.

Table lookup

is equivalent to linear approximation with one-hot encoding of states in the feature vector. In this case, we end up just ‘selecting’ a value from the parameter vector, which is, a lookup.

Working without an Oracle

  • We use MC or TD to substitute the return for the target of our value function.
    • So we do supervised learning on the MC returns or TD targets.

Monte-Carlo with Value Function Approximation

  • MC returns are our training data - the return are unbiased, noisy sample of true value of

TD Learning with Value Function Approximation

  • TD target is a biased sample of true value of
    • Biased because we use the same approximator in getting the target values.
  • Can still use them as training data.
  • Remember that this is all incremental - we are updating the approximator at each step of “training data”, and then using the updated weights to generate TD targets in this case.
  • Linear TD(0) converges (close) to global optimum. (even though it is biased)

Control with Value Function Approximation

Action Value Function Approximation

  • Input: The state, action and the true action value function (let’s say we have an oracle for this)
    • We can represent states by a feature vector
  • Output: Estimated value function
    • Is then just - a simple linear transformation on states.
  • Loss function : MSE b/w true action value function and estimated action value function.
  • Same machinery as Value function for MC and TD targets works for action-value function too.

Convergence of Prediction Algorithms

  • TD can blow up in certain cases instead of converging.
    • TD does not follow gradient of any objective function.
  • Gradient TD follows true gradient of projected Bellman error and always converges.
On/Off-PolicyAlgorithmTable LookupLinearNon-Linear
On-PolicyMC
Off-PolicyMC

Convergence of Control Algorithms

AlgorithmTable LookupLinearNon-Linear
Monte-Carlo Control
Sarsa
Q-learning
Gradient Q-learning

chatters (flirts) around near-optimal value function

Batch Methods

  • Iterative methods are not sample efficient - we don’t learn a whole lot from seeing an experience just once - since we only make small updates.
  • Solution: Experience Replay
    • Store all experience data ( pairs) from the oracle/expert policy.
    • Sample from this dataset.
    • Applying SGD update over batched data from the experience replay in a supervised learning fashion with MSE loss function gives us a neat least squares optimization over some “expert/oracle” policy.

DQN: Deep Q-Networks

  • Proposed in Mnih et al. (2013).
  • Two key ideas:
    • Experience replay and mini batch SGD breaks the correlation b/w valued in sequential time step values and also improves sample efficiency since we see each sample multiple times.
    • Computing Q-learning targets wrt a network frozen a little while back (say 1000 steps ago) - this stabilizes the network and prevents blowup. Swap after the number of steps with new network and continue.
  • Steps:
    • Take greedy action according to greedy policy
    • Store the experiences in replay memory/dataset
    • Sample mini batch from
    • Compute Q-learning targets wrt old fixed NN params.
    • Do a mini batch SGD optimization step.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing Atari with Deep Reinforcement Learning. arXiv. https://doi.org/10.48550/arXiv.1312.5602