- For most real problems, the numbers of states and actions are too large to build state/action value function “tables” - storing a value for each state
- Even if we could store - processing them would be a too slow/intractable.
- Can we build approximate functions for state/action value functions? Yes, NNs are universal function estimators.
- We are going to build a parametric function // with weights that approximates // (this last quantity is an array of for all ).
- This lets us not only not have to store all values, but also generalize from seen to unknown states.
- We update parameter with MC or TD.
What Function Estimator to Use
- Possible candidates:
- NN
- Linear combinations of features
- Decision Trees
- Nearest neighbor
- Fourier/ wavelet bases
- Properties we seek:
- Differentiable - so we can use gradient descent to incrementally adjust params
- Suitable for non-stationary, non-i.i.d. data - because we keep exploring new areas, new rewards are getting discovered, so value function and policy keeps changing.
Gradient Descent Methods
Linear Value Function Approximation
- Input: The state, and the true value function (let’s say we have an oracle for this)
- We can represent states by a feature vector
- Output: Estimated value function
- Is then just - a simple linear transformation on states.
- Loss function : MSE b/w true value function and estimated value function
- Is convex for Linear Approximation, so no local minima to deal with.
Table lookup
is equivalent to linear approximation with one-hot encoding of states in the feature vector. In this case, we end up just ‘selecting’ a value from the parameter vector, which is, a lookup.
Working without an Oracle
- We use MC or TD to substitute the return for the target of our value function.
- So we do supervised learning on the MC returns or TD targets.
Monte-Carlo with Value Function Approximation
- MC returns are our training data - the return are unbiased, noisy sample of true value of
TD Learning with Value Function Approximation
- TD target is a biased sample of true value of
- Biased because we use the same approximator in getting the target values.
- Can still use them as training data.
- Remember that this is all incremental - we are updating the approximator at each step of “training data”, and then using the updated weights to generate TD targets in this case.
- Linear TD(0) converges (close) to global optimum. (even though it is biased)
Control with Value Function Approximation
Action Value Function Approximation
- Input: The state, action and the true action value function (let’s say we have an oracle for this)
- We can represent states by a feature vector
- Output: Estimated value function
- Is then just - a simple linear transformation on states.
- Loss function : MSE b/w true action value function and estimated action value function.
- Same machinery as Value function for MC and TD targets works for action-value function too.
Convergence of Prediction Algorithms
- TD can blow up in certain cases instead of converging.
- TD does not follow gradient of any objective function.
- Gradient TD follows true gradient of projected Bellman error and always converges.
| On/Off-Policy | Algorithm | Table Lookup | Linear | Non-Linear |
|---|---|---|---|---|
| On-Policy | MC | |||
| Off-Policy | MC | |||
Convergence of Control Algorithms
| Algorithm | Table Lookup | Linear | Non-Linear |
|---|---|---|---|
| Monte-Carlo Control | |||
| Sarsa | |||
| Q-learning | |||
| Gradient Q-learning |
chatters (flirts) around near-optimal value function
Batch Methods
- Iterative methods are not sample efficient - we don’t learn a whole lot from seeing an experience just once - since we only make small updates.
- Solution: Experience Replay
- Store all experience data ( pairs) from the oracle/expert policy.
- Sample from this dataset.
- Applying SGD update over batched data from the experience replay in a supervised learning fashion with MSE loss function gives us a neat least squares optimization over some “expert/oracle” policy.
DQN: Deep Q-Networks
- Proposed in Mnih et al. (2013).
- Two key ideas:
- Experience replay and mini batch SGD breaks the correlation b/w valued in sequential time step values and also improves sample efficiency since we see each sample multiple times.
- Computing Q-learning targets wrt a network frozen a little while back (say 1000 steps ago) - this stabilizes the network and prevents blowup. Swap after the number of steps with new network and continue.
- Steps:
- Take greedy action according to greedy policy
- Store the experiences in replay memory/dataset
- Sample mini batch from
- Compute Q-learning targets wrt old fixed NN params.
- Do a mini batch SGD optimization step.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing Atari with Deep Reinforcement Learning. arXiv. https://doi.org/10.48550/arXiv.1312.5602