RL 10: Classic Games

Classic games = games people have been playing for a while, like Chess, Go, Poker, etc.
Why study games?
- Most such games have simple rules, deeper concepts
- Reasonable IQ test for AI
- Have well built strategies by humans over thousands of years
- Some of them, like poker, have hidden states.

Optimality in Games

What is the optimal policy $π^{i}$ for the $i^{t h}$ player?
- If all other players fix their policies $π^{- i}$
- Best response $π_{*}^{i} (π^{- i})$ is optimal policy against all those other (fixed) policies
- In other words, we only need to do better than whoever else is playing.
Nash Equilibrium is a joint policy for all the players $π^{i} = π_{*}^{i} (π^{- i})$
- The best policy for each player - every policy is the optimal response to all the other players
- No rational player would choose to deviate from such a policy - else someone else’s policy will exploit their new policy and this player will be worse off.

Best response solution to single agent RL problem:
- All other players are part of the environment
- Game reduced to MDP
- Best response is optimal policy for this MDP
Nash equilibrium
- is a fixed point of self-play RL
Self Play
- I keep playing RL games against myself
- All agent policies keep improving by playing against each other.
- We keep iterating and improving policies until we converge to a point where all policies are best response/optimal policies against each other
- That fixed point is a Nash equilibrium

2 alternating players
A win/point for one player is a loss for another - the rewards are equal and opposite. $R_{1} + R_{2} = 0$

Perfect Information = fully observed
- Chess
- Go
- Backgammon
- …etc.
Imperfect Information = Partially observed, some hidden states
- Poker
- Scrabble

A value-function here is the expected total reward given joint policies for all players. $v_{π} (s) = E_{π} [G_{t} ∣ S_{t} = s]$
- In zero-sum two-player games, we just need to evaluate one policy and negate it to get the value for the second player.
A minimax value function maximized white’s expected return while minimizing black’s expected return. $v_{*} (s) = π^{1} max π^{2} min v_{π} (s)$
A minimax policy is a joint policy $π = ⟨ π^{1}, π^{2} ⟩$ that achieves the minimax values
- There is a unique minimax value function.
- A minimax policy is always a Nash equilibrium.

Build a tree for all possibilities till the roots where the rewards are, then choose nodes according to whether the player at that turn is maximizing or minimizing the reward at that time (zero sum game, negative rewards are good for one player), propagate the chosen rewards back up till you have the optimal play policy.
Problem: Search tree grows exponentially, impractical for any realistic game of interest
Solution: Build a value function approximator (aka heuristic) and approximate minimax values, then run minimax search to a fixed depth, do the approximation, then back those values up to root.