- Reinforcement learning:
- No supervisor - only reward signal
- Delayed feedback - not instant. Make a decision and get feedback about good or bad later
- Time matters - data is not iid but sequential - time series
- Agent’s actions change data it receives in the next timesteps
- Reward:
- scalar feedback
- kind of a loss - indicates how well (instead of how bad) the agent is doing
- Reward hypothesis: goal is to select actions to maximize cumulative reward
- Greedy may not work - taking a smaller reward now might lead to greater rewards later
- At each step :
- The agent:
- Take an action
- Receive an observation
- Get a scalar reward
- The environment:
- Receive an action
- Give observation
- Give reward
- The agent:
- History:
- Sequence of actions and observations and rewards
- State:
- Internal representation of the all the information of the world and that determines what happens next
- It’s a function of the history:
- Environment state
- Private - not available to agent
- Completely determines what happens next - the observation and the reward
- Agent state :
- agent’s model of the environment’s state
- determines how the agent chooses its actions
Information State
An information state contains all useful information from the history.
- Information states can be Markovian, i.e., any state captures all the information to make the next decision:
- Observability of the environment:
- Fully observable: Agent directly observes the environment state, i.e.
- Formally, this is Markov Decision Process or MDP.
- Partially observable: Agent doesn’t observe the entire environment state - only some part of it - just a camera input or just the public cards or just part of the road in a 1m radius circle
- This is Partially Observable MDP or POMDP
- Agent must construct its own state representation of the world.
- Fully observable: Agent directly observes the environment state, i.e.
- Some Taxonomy:
- Policy: how the agent picks its actions
- Value function: how good is each state - overall expected reward from a state till the end of time
- Model: Agent’s representation of the environment
- Policy:
- Determines agent’s behavior
- Map from state to action
- Deterministic:
- Stochastic: , or in other words, there’s a distribution of actions at each given state
- Value function:
- Prediction of expected total future reward of a state
- How good is a state
- Value function is a function of how the agent behaves, so it is also a function of the agent policy
- Model:
- Of the environment
- Predicts what the environment will do - what observation and reward will we get if we take an action
- So you can have two models:
- A state transition model to determine what state you’ll be in
- A reward model to predict the reward you get when you take an action in a certain state
- Kinds of RL agents:
- Value based: Determine value function - policy becomes implicity - to choose action with maximum value state transition
- Policy based: Actually have a policy on how to behave when in a certain state
- Actor Critic: Builds both a policy and value function.
Model Free vs Model Based
- Model Free:
- Policy or value function
- No direct modeling of how the environment works
- Model Based:
- First build a model of how the environment works
- Learning and Planning:
- RL: environment is unknown, we interact with it, and improve our policy
- Planning: A model of the environment is known, we perform computations on the model without interaction and improve a policy
- Exploitation vs Exploration:
- Exploration: give up some reward you know you’ll get to find out more about the environment
- Exploitation: Use what you already know to maximize reward
- Prediction vs Control
- Prediction: Evaluate the future (reward) given a policy
- Control: Find the optimal policy to get the most value
- Need to solve Prediction to solve control