Classic games = games people have been playing for a while, like Chess, Go, Poker, etc.
Why study games?
Most such games have simple rules, deeper concepts
Reasonable IQ test for AI
Have well built strategies by humans over thousands of years
Some of them, like poker, have hidden states.
Optimality in Games
What is the optimal policy πi for the ith player?
If all other players fix their policies π−i
Best response π∗i(π−i) is optimal policy against all those other (fixed) policies
In other words, we only need to do better than whoever else is playing.
Nash Equilibrium is a joint policy for all the players
πi=π∗i(π−i)
The best policy for each player - every policy is the optimal response to all the other players
No rational player would choose to deviate from such a policy - else someone else’s policy will exploit their new policy and this player will be worse off.
Single-Agent and Self-Play RL
Best response solution to single agent RL problem:
All other players are part of the environment
Game reduced to MDP
Best response is optimal policy for this MDP
Nash equilibrium
is a fixed point of self-play RL
Self Play
I keep playing RL games against myself
All agent policies keep improving by playing against each other.
We keep iterating and improving policies until we converge to a point where all policies are best response/optimal policies against each other
That fixed point is a Nash equilibrium
Two-Player Zero-Sum Game
2 alternating players
A win/point for one player is a loss for another - the rewards are equal and opposite.
R1+R2=0
Perfect Information Vs Imperfect Information Games
Perfect Information = fully observed
Chess
Go
Backgammon
…etc.
Imperfect Information = Partially observed, some hidden states
Poker
Scrabble
Minimax
A value-function here is the expected total reward given joint policies for all players.
vπ(s)=Eπ[Gt∣St=s]
In zero-sum two-player games, we just need to evaluate one policy and negate it to get the value for the second player.
A minimax value function maximized white’s expected return while minimizing black’s expected return.
v∗(s)=π1maxπ2minvπ(s)
A minimax policy is a joint policy π=⟨π1,π2⟩ that achieves the minimax values
There is a unique minimax value function.
A minimax policy is always a Nash equilibrium.
Minimax Search
Build a tree for all possibilities till the roots where the rewards are, then choose nodes according to whether the player at that turn is maximizing or minimizing the reward at that time (zero sum game, negative rewards are good for one player), propagate the chosen rewards back up till you have the optimal play policy.
Problem: Search tree grows exponentially, impractical for any realistic game of interest
Solution: Build a value function approximator (aka heuristic) and approximate minimax values, then run minimax search to a fixed depth, do the approximation, then back those values up to root.