Class notes

Reinforcement learning lecture material

15 views 0 purchase

Course
Reinforcement Learning (SOWBKI258)

Institution
Radboud Universiteit Nijmegen (RU)

Summary of all lectures. This is all exam material

[Show more]

Preview 3 out of 22 pages

View example

Uploaded on October 19, 2024
Number of pages 22
Written in 2023/2024
Type Class notes
Professor(s) Yuzhen qin
Contains All classes

learning
agents
reinforcement learning

Institution
Radboud Universiteit Nijmegen (RU)
Education
Artificial Intelligence
Course
Reinforcement Learning (SOWBKI258)

donjaschipper

Member since 2 months 3 documents sold

$5.92

Add to cart

Add to wishlist

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Lecture notes Reinforcement Learning
Lecture 1: Introduction
Reinforcement Learning: learning to make decisions by interaction.

RL is different from (un)supervised learning:

- Agent sequentially interacts with the environment
- Agent observes a reward and the state of the environment
- Goal-directed: maximize the cumulative reward

Learning based on the reward hypothesis: any goal objective can be formalized as the outcome of
maximizing a cumulative reward.

Science and framework of learning to make decisions from interaction:

- It requires us to think about
o Time
o Consequences of actions
o Actively gathering experience
o Predicting about future
- At each step t
o Agent takes an action at
o Receives a reward rt (observed by agent)
o Environment: state transitions to st (observed by agent)
o Goal: maximize cumulative reward

A policy defines the agent’s way of behaving: π (a∨s).

Deterministic: it states what the next state is going to be for every action.
Stochastic: multiple states can be reached by an action.

Value function: v π (s)

- Describe how good a state is
- Depend on the state
- Depend on the policy
- Expected total reward!
A state might always yield a low immediate reward, but still have a high value because it is
regularly followed by other states that yield high rewards.

A model describes/mimics the behaviour of the environment. A model can:

- assist in estimating value functions
- make plans of a course of actions
- Model-Based RL: learns/uses a model to make decisions
- Model-Free RL: learns to make decisions solely based on interactions with the environment

History of RL
- RL stems from 3 different domains: trial-and-error, optimal control, temporal difference (TD)
o Trial-and-error

,  Law of effect: if an association is followed by a “satisfying state of affairs” it
will be strengthened and if it is followed by an “annoying state of affairs” it
will be weakened
 Clear parallel with evolution by natural selection
 Selection involves no guidance or supervision: depends only on the
consequences of actions or adaptation and whether these serve the
organism’s biological needs
 Credit assignment: process of attributing or assigning credit or responsibility
to past actions for the rewards received at a later time
o Optimal Control
 Control goal: drive a system from a state to a desired state
 Objective: minimize time/energy
 Optimal control generate a sequence of control inputs to realize our goal.
 Model: st +1=f (st ,ut )
 Goal - find control policy: optimize O( st ,ut )
 Class of methods:
 Optimal return function – Bellman Equation
 Dynamical programming
 Markovian Decision Processes (MDPs)
o Policy Iteration Method (to solve MDPs)
o Temporal Difference Learning
 Bootstrapping from existing estimates of the value function
 Re-estimate the value functions little by little
 Has a neural basis in the brain
 Fundamental aspect of TD: the calculation of reward prediction errors (i.e.,
difference between expected and actually received rewards).
o In 1989 Modern RL begins

Lecture 2: Markovian Decision Processes
Markov Processes
- Formalization of sequential decision-making, where the choice of actions influence:
o immediate rewards
o subsequent situations/states
o subsequent rewards
- Markov property: P [ S t +1|S 1 , … , S t ]=P [ S t+1|St ]
- M =⟨S , P ⟩
o P is a transition matrix
o S is a state space (finite or infinite, we usually use finite)
- Markov Reward Process
o M =⟨S , P , R , γ ⟩
 R is reward function
 Gamma is the discount factor (between 0 and 1)
 Trades of later rewards to earlier ones
o Rewards only depend on the current action!
- Markov Decision Processes
o M =⟨S , A , P , R , γ ⟩

,  A is the action space
o At each step t=1,2,3 , … , T
 Agent receives some representation of the environment's state
 Based on that information, agent selects an action
 Action modifies system’s state and generates a reward
o p ( s ' , r|s , a )=P [ S t +1=s' , R t+1 =r|S t =s , A t=a ¿
 p : S × R × S × A → [0, 1]
 fully defines the dynamics of the MDP
o State-transition probabilities
 p ( s '|s , a )=P [ St =s'|S t −1 =s , A t−1=a ]= ∑ p( s ' , r∨s , a)
r∈ R
o Expected reward for state-action pairs

'
p( s , r∨s , a)
r ( s , a )=E [ Rt|S t −1 =s , A t−1=a ]= ∑ r ∑ p (s ' , r ∨s ,a)=∑ r
r∈ R s' ∈S r ∈R p (s ' ∨s , a)
Goals, Rewards & Returns
- Goal/objective
o Is formalized as a reward signal from the environment to the agent
o What the agents wants to maximize over time (not immediate reward, but
cumulative reward)
- Rewards
o Formalized the objective: what we want to achieve (not how)
o Customize rewards such that maximizing them allows the agent to achieve an
objective
- Two types of tasks
o Episodic tasks (terminating tasks)
 Agent interacts with the environment for a finite number of steps
 Agent-env interaction breaks naturally into sub-sequences
 Have a clear terminal state
 Time of termination T is usually a random variable that can vary from episode
to episode
o Continuing tasks (non-terminating tasks)
 Agent interacts with the environment for an infinite number of steps
 Agent-env interaction doesn’t break naturally into sub-sequences
 No terminal state
- Returns
o Episodic tasks: Gt =Rt +1 + Rt +2+ R t +3+ …+ RT
 T: time of termination
o Continuing tasks: Gt =Rt +1 +γ R t+2 + y 2 Rt +3 +…
 Discount factor ƴ
 0≤γ≤1
 Determines the present value of future rewards
 γ=0 – Miopic agent: maximize only immediate reward
 γ=1 – Farsighted agent: take future reward into account
o Most Markov decision processes are discounted. Why?

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller donjaschipper. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $5.92. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

51683 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 15 years now

Start selling

Popular Universities in the United States

Popular books

Find notes and summaries for these qualifications

Seller