RL 01: Introduction to Reinforcement Learning

Reinforcement learning:
- No supervisor - only reward signal
- Delayed feedback - not instant. Make a decision and get feedback about good or bad later
- Time matters - data is not iid but sequential - time series
- Agent’s actions change data it receives in the next timesteps
Reward:
- scalar feedback
- kind of a loss - indicates how well (instead of how bad) the agent is doing
- Reward hypothesis: goal is to select actions to maximize cumulative reward
- Greedy may not work - taking a smaller reward now might lead to greater rewards later
At each step $t$ :
- The agent:
  - Take an action $A_{t}$
  - Receive an observation $O_{t}$
  - Get a scalar reward $R_{t}$
- The environment:
  - Receive an action $A_{t}$
  - Give observation $O_{t}$
  - Give reward $R_{t}$
History:
- Sequence of actions and observations and rewards $H_{t} = O_{1}, R_{1}, A_{1}, \dots, A_{t - 1}, O_{t}, R_{t}$
State:
- Internal representation of the all the information of the world and that determines what happens next
- It’s a function of the history: $S_{t} = f (H_{t})$
- Environment state $S_{t}^{e}$
  - Private - not available to agent
  - Completely determines what happens next - the observation and the reward
- Agent state $S_{t}^{a}$ :
  - agent’s model of the environment’s state
  - determines how the agent chooses its actions

Information State

An information state contains all useful information from the history.

Information states can be Markovian, i.e., any state captures all the information to make the next decision: $P [S_{t + 1} ∣ S_{1}, \dots, S_{t}] = P [S_{t + 1} ∣ S_{t}]$
Observability of the environment:
- Fully observable: Agent directly observes the environment state, i.e. $O_{t} = S_{t}^{a} = S_{t}^{e}$
  - Formally, this is Markov Decision Process or MDP.
- Partially observable: Agent doesn’t observe the entire environment state - only some part of it - just a camera input or just the public cards or just part of the road in a 1m radius circle
  - This is Partially Observable MDP or POMDP
  - Agent must construct its own state representation of the world.
Some Taxonomy:
- Policy: how the agent picks its actions
- Value function: how good is each state - overall expected reward from a state till the end of time
- Model: Agent’s representation of the environment
Policy:
- Determines agent’s behavior
- Map from state to action
- Deterministic: $a = π (s)$
- Stochastic: $π (a ∣ s) = P [A_{t} = a ∣ S_{t} = s]$ , or in other words, there’s a distribution of actions at each given state
Value function:
- Prediction of expected total future reward of a state
- How good is a state
- Value function is a function of how the agent behaves, so it is also a function of the agent policy $π$ $v_{π} (s) = E_{π} [R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots ∣ S_{t} = s]$
- Model:
  - Of the environment
  - Predicts what the environment will do - what observation and reward will we get if we take an action
  - So you can have two models:
    - A state transition model $P_{s s^{'}}$ to determine what state you’ll be in
    - A reward model $R_{s}$ to predict the reward you get when you take an action in a certain state
- Kinds of RL agents:
  - Value based: Determine value function - policy becomes implicity - to choose action with maximum value state transition
  - Policy based: Actually have a policy on how to behave when in a certain state
  - Actor Critic: Builds both a policy and value function.

Model Free vs Model Based

Model Free:

Policy or value function

No direct modeling of how the environment works

Model Based:

First build a model of how the environment works

Learning and Planning:
- RL: environment is unknown, we interact with it, and improve our policy
- Planning: A model of the environment is known, we perform computations on the model without interaction and improve a policy
Exploitation vs Exploration:
- Exploration: give up some reward you know you’ll get to find out more about the environment
- Exploitation: Use what you already know to maximize reward
Prediction vs Control
- Prediction: Evaluate the future (reward) given a policy
- Control: Find the optimal policy to get the most value
- Need to solve Prediction to solve control

Adarsh's Notes

Explorer

RL 01: Introduction to Reinforcement Learning

Backlinks