- MDP:
- Environment is fully observable - the current state completely characterizes the process
- Review the Markov Property
- State captures all the relevant information from the history
- State Transition Matrix
- Probability of transition from any state to any other state
- Each row sums to 1 since it covers all possible state transitions
Markov Process
A Markov Process (or Markov Chain) is a tuple
- is a (finite) set of states
- is a state transition probability matrix,
- Markov Reward Process: Markov process with value judgements - now we are also talking about the rewards, not just probabilities of transition.
Markov Reward Process
A Markov Reward Process is a tuple
- is a finite set of states
- is a state transition probability matrix,
- is a reward function,
- is a discount factor,
- The return (or the goal):
- Total discounted reward from time step (rewards are given when you exit a state - hence the first subscript is at )
- Why discount:
- Keeps reward mathematically bounded
- Humans prefer immediate rewards to far away reward - we do hyperbolic discounting
- It’s a generalization of markov processes - can set it to 1 if you want and mathematically feasible
- Value function:
- Expected return of the given state over all possible paths from the state, i.e.,
Bellman equation
Gives us a recursive definition of the value function: The value of current state is the sum of:
- the immediate reward from this state
- discounted value (times ) of the successor state, i.e.,
which leads to the recursive Bellman equation:
which can be expressed more concisely in vector form, with and being column vectors with each entry corresponding to a state.
-
Bellman equation:
- Linear equation, can be solved directly (solution being ), but needs matrix inversion which has complexity
- Solve using
- DP
- Monte Carlo evaluation
- Temporal Difference learning
-
Markov Decision Process: Markov reward process + actions
Markov Decision Process
A Markov Decision Process is a tuple
- is a finite set of states
- is a finite set of actions
- is a state transition probability matrix,
- is a reward function,
- is a discount factor .
- Note that MDP state transition probabilities now depend on the actions instead of just being stochastic.
- Actions are usually stochastic - agent in a state can do with the probability , which brings us to the notion of a policy:
- Policy is the distribution of actions over states, or a mapping of states to actions
- Note that the policy depends on the current state and is independent of the time (in other words, it is stationary).
- MDP policies are markovian - action only depends on the current state, not history.
A policy is a distribution of actions over states:
- Policy determines actions (given the state), which in turn determines the next state and reward.
- State transition probabilities and expected return is a function of policy now - it’s the transition probability for each action times the probability of taking that action in the current state
- Specifically,
- State value function now depends on the policy.
- We also define a new kind of value function - one over the actions.
Action Value Function
is the expected return starting from state , taking action , and then following policy .
- Action value function helps us determine the best action to take given a state.
Bellman Expectation equation
Just as we did earlier for value function in MRP, we can decompose the state-value and action-value functions for MDP like so:
We can write the action value function as
and the state value function as
These functions are now recursively defined in terms of each other. Substituting and rewriting them as their own recursive functions, we get:
Once again, writing them in vector form using our new definitions of and gives:
which again has a direct solution:
- Basic idea remains that value functions are recursive and you get current reward and then move on to do the same calculation for all the things we might do from there and all the places we might arrive at.
- But does it tell us the best way to behave?
- Optimal policies:
- The optimal state value function is the maximum value over all policies. It is the maximum possible reward you can extract from this system.
- Similarly, optimal action value function is the maximum action value over all policies.
- Once you have these, you basically have the optimal policy, which is: in every state, take optimal action according to the optimal action value function.
- An optimal policy can be found by maximizing over ,
- A deterministic optimal policy always exists for every MDP.
- All optimal policies extract the optimal state value from the system and achieve the optimal action value.
Bellman optimality equation
(We basically substitute all with and all with from the per policy equations)
which gives
and
- Earlier equations were linear and could be solved by solving matrix linear system because they involved summation instead of max. But with max, these equations are non linear and don’t have a general solution, and would need iterative solutions. Some methods to solve these include:
- Value iteration
- Policy iteration
- Q-learning
- Sarsa
- Extensions to MDPs:
- Infinite and continuous
- Partially observable
- Undiscounted, average reward MDPs