• Reinforcement learning:
    • No supervisor - only reward signal
    • Delayed feedback - not instant. Make a decision and get feedback about good or bad later
    • Time matters - data is not iid but sequential - time series
    • Agent’s actions change data it receives in the next timesteps
  • Reward:
    • scalar feedback
    • kind of a loss - indicates how well (instead of how bad) the agent is doing
    • Reward hypothesis: goal is to select actions to maximize cumulative reward
    • Greedy may not work - taking a smaller reward now might lead to greater rewards later
  • At each step :
    • The agent:
      • Take an action
      • Receive an observation
      • Get a scalar reward
    • The environment:
      • Receive an action
      • Give observation
      • Give reward
  • History:
    • Sequence of actions and observations and rewards
  • State:
    • Internal representation of the all the information of the world and that determines what happens next
    • It’s a function of the history:
    • Environment state
      • Private - not available to agent
      • Completely determines what happens next - the observation and the reward
    • Agent state :
      • agent’s model of the environment’s state
      • determines how the agent chooses its actions

Information State

An information state contains all useful information from the history.

  • Information states can be Markovian, i.e., any state captures all the information to make the next decision:
  • Observability of the environment:
    • Fully observable: Agent directly observes the environment state, i.e.
      • Formally, this is Markov Decision Process or MDP.
    • Partially observable: Agent doesn’t observe the entire environment state - only some part of it - just a camera input or just the public cards or just part of the road in a 1m radius circle
      • This is Partially Observable MDP or POMDP
      • Agent must construct its own state representation of the world.
  • Some Taxonomy:
    • Policy: how the agent picks its actions
    • Value function: how good is each state - overall expected reward from a state till the end of time
    • Model: Agent’s representation of the environment
  • Policy:
    • Determines agent’s behavior
    • Map from state to action
    • Deterministic:
    • Stochastic: , or in other words, there’s a distribution of actions at each given state
  • Value function:
    • Prediction of expected total future reward of a state
    • How good is a state
    • Value function is a function of how the agent behaves, so it is also a function of the agent policy
    • Model:
      • Of the environment
      • Predicts what the environment will do - what observation and reward will we get if we take an action
      • So you can have two models:
        • A state transition model to determine what state you’ll be in
        • A reward model to predict the reward you get when you take an action in a certain state
    • Kinds of RL agents:
      • Value based: Determine value function - policy becomes implicity - to choose action with maximum value state transition
      • Policy based: Actually have a policy on how to behave when in a certain state
      • Actor Critic: Builds both a policy and value function.

Model Free vs Model Based

  • Model Free:
    • Policy or value function
    • No direct modeling of how the environment works
  • Model Based:
    • First build a model of how the environment works
  • Learning and Planning:
    • RL: environment is unknown, we interact with it, and improve our policy
    • Planning: A model of the environment is known, we perform computations on the model without interaction and improve a policy
  • Exploitation vs Exploration:
    • Exploration: give up some reward you know you’ll get to find out more about the environment
    • Exploitation: Use what you already know to maximize reward
  • Prediction vs Control
    • Prediction: Evaluate the future (reward) given a policy
    • Control: Find the optimal policy to get the most value
    • Need to solve Prediction to solve control