RL 02: Markov Decision Process

MDP:
- Environment is fully observable - the current state completely characterizes the process
Review the Markov Property
- State captures all the relevant information from the history
State Transition Matrix
- Probability of transition from any state to any other state $P_{s s^{'}} = P [S_{t + 1} = s^{'} ∣ S_{t} = s]$
- Each row sums to 1 since it covers all possible state transitions

Markov Process

A Markov Process (or Markov Chain) is a tuple $⟨ S, P ⟩$

$S$ is a (finite) set of states

$P$ is a state transition probability matrix,

$P_{s s^{'}} = P [S_{t + 1} = s^{'} ∣ S_{t} = s]$

Markov Reward Process: Markov process with value judgements - now we are also talking about the rewards, not just probabilities of transition.

Markov Reward Process

A Markov Reward Process is a tuple $⟨ S, P, R, γ ⟩$

$S$ is a finite set of states

$P$ is a state transition probability matrix,

$P_{s s^{'}} = P [S_{t + 1} = s^{'} ∣ S_{t} = s]$

$R$ is a reward function, $R_{s} = E [R_{t + 1} ∣ S_{t} = s]$

$γ$ is a discount factor, $γ \in [0, 1]$

The return (or the goal):
- Total discounted reward from time step $t$ (rewards are given when you exit a state - hence the first subscript is at $t + 1$ ) $G_{t} = R_{t + 1} + γ R_{t + 2} + \dots = k = 0 \sum \infty γ^{k} R_{t + k + 1}$
Why discount:
- Keeps reward mathematically bounded
- Humans prefer immediate rewards to far away reward - we do hyperbolic discounting
- It’s a generalization of markov processes - can set it to 1 if you want and mathematically feasible
Value function:
- Expected return of the given state over all possible paths from the state, i.e., $v (s) = E [G_{t} ∣ S_{t} = s]$

Bellman equation

Gives us a recursive definition of the value function: The value of current state $v (S_{t})$ is the sum of:

the immediate reward from this state $R_{t + 1}$

discounted value (times $γ$ ) of the successor state, i.e., $γ v (S_{t + 1})$

$v (S_{t}) = E [G_{t} ∣ S_{t} = s] = E [R_{t + 1} + γ v (S_{t + 1}) ∣ S_{t} = s]$
which leads to the recursive Bellman equation:
$v (s) = R_{s} + γ E [G_{t + 1} ∣ S_{t} = s]$ $v (s) = R_{s} + γ s^{'} \in S \sum P_{s s^{'}} v (s^{'})$
which can be expressed more concisely in vector form, with $v$ and $R$ being column vectors with each entry corresponding to a state.
$v = R + γ P v$

Bellman equation:
- Linear equation, can be solved directly (solution being $v = (I - γ P)^{- 1} R$ ), but needs matrix inversion which has complexity $O (n^{3})$
- Solve using
  - DP
  - Monte Carlo evaluation
  - Temporal Difference learning
Markov Decision Process: Markov reward process + actions

Markov Decision Process

A Markov Decision Process is a tuple $⟨ S, A, P, R, γ ⟩$

$S$ is a finite set of states

$A$ is a finite set of actions

$P$ is a state transition probability matrix,

$P_{s s^{'}}^{a} = P [S_{t + 1} = s^{'} ∣ S_{t} = s, A_{t} = a]$

$R$ is a reward function, $R_{s}^{a} = E [R_{t + 1} ∣ S_{t} = s, A_{t} = a]$

$γ$ is a discount factor $γ \in [0, 1]$ .

Note that MDP state transition probabilities now depend on the actions instead of just being stochastic.
Actions are usually stochastic - agent in a state $s$ can do $a_{i}$ with the probability $P_{i}$ , which brings us to the notion of a policy:
- Policy is the distribution of actions over states, or a mapping of states to actions
- Note that the policy depends on the current state and is independent of the time (in other words, it is stationary).
- MDP policies are markovian - action only depends on the current state, not history.

A policy is a distribution of actions over states:

$π (a ∣ s) = P [A_{t} = a ∣ S_{t} = s]$

Policy determines actions (given the state), which in turn determines the next state and reward.
- State transition probabilities and expected return is a function of policy now - it’s the transition probability for each action times the probability of taking that action in the current state
- Specifically, $P_{s, s^{'}}^{π} = a \in A \sum π (a ∣ s) P_{s s^{'}}^{a}$ $R_{s}^{π} = a \in A \sum π (a ∣ s) R_{s}^{a}$
State value function now depends on the policy.
We also define a new kind of value function - one over the actions.

Action Value Function

is the expected return starting from state $s$ , taking action $a$ , and then following policy $π$ .
$q_{π} (s, a) = E_{π} [G_{t} ∣ S_{t} = s, A_{t} = a]$

Action value function helps us determine the best action to take given a state.

Bellman Expectation equation

Just as we did earlier for value function in MRP, we can decompose the state-value and action-value functions for MDP like so:

We can write the action value function as
$q_{π} (s, a) = R_{s}^{a} + γ s^{'} \in S \sum P_{s s^{'}}^{a} v_{π} (s^{'})$
and the state value function as
$v_{π} (s) = a \in A \sum π (a ∣ s) q_{π} (s, a)$
These functions are now recursively defined in terms of each other. Substituting and rewriting them as their own recursive functions, we get:
$v_{π} (s) = a \in A \sum π (a ∣ s) (R_{s}^{a} + γ s^{'} \in S \sum P_{s s^{'}}^{a} v_{π} (s^{'}))$ $q_{π} (s, a) = R_{s}^{a} + γ s^{'} \in S \sum P_{s s^{'}}^{a} a^{'} \in A \sum π (a^{'} ∣ s^{'}) q_{π} (s^{'}, a^{'})$
Once again, writing them in vector form using our new definitions of $R^{π}$ and $P^{π}$ gives:
$v_{π} = R^{π} + γ P^{π} v_{π}$
which again has a direct solution:
$v_{π} = (I - γ P^{π})^{- 1} R^{π}$

Basic idea remains that value functions are recursive and you get current reward and then move on to do the same calculation for all the things we might do from there and all the places we might arrive at.
But does it tell us the best way to behave?
Optimal policies:
- The optimal state value function is the maximum value over all policies. It is the maximum possible reward you can extract from this system. $v_{*} (s) = π max v_{π} (s)$
- Similarly, optimal action value function is the maximum action value over all policies. $q_{*} (s, a) = π max q_{π} (s, a)$
- Once you have these, you basically have the optimal policy, which is: in every state, take optimal action according to the optimal action value function.
  - An optimal policy can be found by maximizing over $q_{*} (s, a)$ , $π_{*} (a ∣ s) = ⎩ ⎨ ⎧ 10 if a = a \in A argmax q_{*} (s, a) otherwise$
- A deterministic optimal policy always exists for every MDP.
  - All optimal policies extract the optimal state value from the system and achieve the optimal action value.

Bellman optimality equation

(We basically substitute all $π$ with $*$ and all $\sum_{a \in A} π (a ∣ s)$ with $max_{a}$ from the per policy equations)
$q_{*} (s, a) = R_{s}^{a} + γ s^{'} \in S \sum P_{s s^{'}}^{a} v_{*} (s^{'})$ $v_{*} (s) = a max q_{*} (s, a)$
which gives
$v_{*} (s) = a max [R_{s}^{a} + γ s^{'} \in S \sum P_{s s^{'}}^{a} v_{*} (s^{'})]$
and
$q_{*} (s, a) = R_{s}^{a} + γ s^{'} \in S \sum [P_{s s^{'}}^{a} a^{'} max q_{*} (s^{'}, a^{'})]$

Earlier equations were linear and could be solved by solving matrix linear system because they involved summation instead of max. But with max, these equations are non linear and don’t have a general solution, and would need iterative solutions. Some methods to solve these include:
- Value iteration
- Policy iteration
- Q-learning
- Sarsa
Extensions to MDPs:
- Infinite and continuous
- Partially observable
- Undiscounted, average reward MDPs

Adarsh's Notes

Explorer

RL 02: Markov Decision Process

Backlinks