Lecture notes Reinforcement Learning
Lecture 1: Introduction
Reinforcement Learning: learning to make decisions by interaction.
RL is different from (un)supervised learning:
- Agent sequentially interacts with the environment
- Agent observes a reward and the state of the environment
- Goal-directed: maximize the cumulative reward
Learning based on the reward hypothesis: any goal objective can be formalized as the outcome of
maximizing a cumulative reward.
Science and framework of learning to make decisions from interaction:
- It requires us to think about
o Time
o Consequences of actions
o Actively gathering experience
o Predicting about future
- At each step t
o Agent takes an action at
o Receives a reward rt (observed by agent)
o Environment: state transitions to st (observed by agent)
o Goal: maximize cumulative reward
A policy defines the agent’s way of behaving: π (a∨s).
Deterministic: it states what the next state is going to be for every action.
Stochastic: multiple states can be reached by an action.
Value function: v π (s)
- Describe how good a state is
- Depend on the state
- Depend on the policy
- Expected total reward!
A state might always yield a low immediate reward, but still have a high value because it is
regularly followed by other states that yield high rewards.
A model describes/mimics the behaviour of the environment. A model can:
- assist in estimating value functions
- make plans of a course of actions
- Model-Based RL: learns/uses a model to make decisions
- Model-Free RL: learns to make decisions solely based on interactions with the environment
History of RL
- RL stems from 3 different domains: trial-and-error, optimal control, temporal difference (TD)
o Trial-and-error
, Law of effect: if an association is followed by a “satisfying state of affairs” it
will be strengthened and if it is followed by an “annoying state of affairs” it
will be weakened
Clear parallel with evolution by natural selection
Selection involves no guidance or supervision: depends only on the
consequences of actions or adaptation and whether these serve the
organism’s biological needs
Credit assignment: process of attributing or assigning credit or responsibility
to past actions for the rewards received at a later time
o Optimal Control
Control goal: drive a system from a state to a desired state
Objective: minimize time/energy
Optimal control generate a sequence of control inputs to realize our goal.
Model: st +1=f (st ,ut )
Goal - find control policy: optimize O( st ,ut )
Class of methods:
Optimal return function – Bellman Equation
Dynamical programming
Markovian Decision Processes (MDPs)
o Policy Iteration Method (to solve MDPs)
o Temporal Difference Learning
Bootstrapping from existing estimates of the value function
Re-estimate the value functions little by little
Has a neural basis in the brain
Fundamental aspect of TD: the calculation of reward prediction errors (i.e.,
difference between expected and actually received rewards).
o In 1989 Modern RL begins
Lecture 2: Markovian Decision Processes
Markov Processes
- Formalization of sequential decision-making, where the choice of actions influence:
o immediate rewards
o subsequent situations/states
o subsequent rewards
- Markov property: P [ S t +1|S 1 , … , S t ]=P [ S t+1|St ]
- M =⟨S , P ⟩
o P is a transition matrix
o S is a state space (finite or infinite, we usually use finite)
- Markov Reward Process
o M =⟨S , P , R , γ ⟩
R is reward function
Gamma is the discount factor (between 0 and 1)
Trades of later rewards to earlier ones
o Rewards only depend on the current action!
- Markov Decision Processes
o M =⟨S , A , P , R , γ ⟩
, A is the action space
o At each step t=1,2,3 , … , T
Agent receives some representation of the environment's state
Based on that information, agent selects an action
Action modifies system’s state and generates a reward
o p ( s ' , r|s , a )=P [ S t +1=s' , R t+1 =r|S t =s , A t=a ¿
p : S × R × S × A → [0, 1]
fully defines the dynamics of the MDP
o State-transition probabilities
p ( s '|s , a )=P [ St =s'|S t −1 =s , A t−1=a ]= ∑ p( s ' , r∨s , a)
r∈ R
o Expected reward for state-action pairs
'
p( s , r∨s , a)
r ( s , a )=E [ Rt|S t −1 =s , A t−1=a ]= ∑ r ∑ p (s ' , r ∨s ,a)=∑ r
r∈ R s' ∈S r ∈R p (s ' ∨s , a)
Goals, Rewards & Returns
- Goal/objective
o Is formalized as a reward signal from the environment to the agent
o What the agents wants to maximize over time (not immediate reward, but
cumulative reward)
- Rewards
o Formalized the objective: what we want to achieve (not how)
o Customize rewards such that maximizing them allows the agent to achieve an
objective
- Two types of tasks
o Episodic tasks (terminating tasks)
Agent interacts with the environment for a finite number of steps
Agent-env interaction breaks naturally into sub-sequences
Have a clear terminal state
Time of termination T is usually a random variable that can vary from episode
to episode
o Continuing tasks (non-terminating tasks)
Agent interacts with the environment for an infinite number of steps
Agent-env interaction doesn’t break naturally into sub-sequences
No terminal state
- Returns
o Episodic tasks: Gt =Rt +1 + Rt +2+ R t +3+ …+ RT
T: time of termination
o Continuing tasks: Gt =Rt +1 +γ R t+2 + y 2 Rt +3 +…
Discount factor ƴ
0≤γ≤1
Determines the present value of future rewards
γ=0 – Miopic agent: maximize only immediate reward
γ=1 – Farsighted agent: take future reward into account
o Most Markov decision processes are discounted. Why?