100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Reinforcement learning lecture material €5,56   In winkelwagen

College aantekeningen

Reinforcement learning lecture material

 9 keer bekeken  0 keer verkocht

Samenvatting van alle lectures. Dit is allemaal tentamenstof

Voorbeeld 3 van de 22  pagina's

  • 19 oktober 2024
  • 22
  • 2023/2024
  • College aantekeningen
  • Yuzhen qin
  • Alle colleges
Alle documenten voor dit vak (1)
avatar-seller
donjaschipper
Lecture notes Reinforcement Learning
Lecture 1: Introduction
Reinforcement Learning: learning to make decisions by interaction.

RL is different from (un)supervised learning:

- Agent sequentially interacts with the environment
- Agent observes a reward and the state of the environment
- Goal-directed: maximize the cumulative reward

Learning based on the reward hypothesis: any goal objective can be formalized as the outcome of
maximizing a cumulative reward.

Science and framework of learning to make decisions from interaction:

- It requires us to think about
o Time
o Consequences of actions
o Actively gathering experience
o Predicting about future
- At each step t
o Agent takes an action at
o Receives a reward rt (observed by agent)
o Environment: state transitions to st (observed by agent)
o Goal: maximize cumulative reward

A policy defines the agent’s way of behaving: π (a∨s).

Deterministic: it states what the next state is going to be for every action.
Stochastic: multiple states can be reached by an action.

Value function: v π (s)

- Describe how good a state is
- Depend on the state
- Depend on the policy
- Expected total reward!
A state might always yield a low immediate reward, but still have a high value because it is
regularly followed by other states that yield high rewards.

A model describes/mimics the behaviour of the environment. A model can:

- assist in estimating value functions
- make plans of a course of actions
- Model-Based RL: learns/uses a model to make decisions
- Model-Free RL: learns to make decisions solely based on interactions with the environment

History of RL
- RL stems from 3 different domains: trial-and-error, optimal control, temporal difference (TD)
o Trial-and-error

,  Law of effect: if an association is followed by a “satisfying state of affairs” it
will be strengthened and if it is followed by an “annoying state of affairs” it
will be weakened
 Clear parallel with evolution by natural selection
 Selection involves no guidance or supervision: depends only on the
consequences of actions or adaptation and whether these serve the
organism’s biological needs
 Credit assignment: process of attributing or assigning credit or responsibility
to past actions for the rewards received at a later time
o Optimal Control
 Control goal: drive a system from a state to a desired state
 Objective: minimize time/energy
 Optimal control generate a sequence of control inputs to realize our goal.
 Model: st +1=f (st ,ut )
 Goal - find control policy: optimize O( st ,ut )
 Class of methods:
 Optimal return function – Bellman Equation
 Dynamical programming
 Markovian Decision Processes (MDPs)
o Policy Iteration Method (to solve MDPs)
o Temporal Difference Learning
 Bootstrapping from existing estimates of the value function
 Re-estimate the value functions little by little
 Has a neural basis in the brain
 Fundamental aspect of TD: the calculation of reward prediction errors (i.e.,
difference between expected and actually received rewards).
o In 1989 Modern RL begins

Lecture 2: Markovian Decision Processes
Markov Processes
- Formalization of sequential decision-making, where the choice of actions influence:
o immediate rewards
o subsequent situations/states
o subsequent rewards
- Markov property: P [ S t +1|S 1 , … , S t ]=P [ S t+1|St ]
- M =⟨S , P ⟩
o P is a transition matrix
o S is a state space (finite or infinite, we usually use finite)
- Markov Reward Process
o M =⟨S , P , R , γ ⟩
 R is reward function
 Gamma is the discount factor (between 0 and 1)
 Trades of later rewards to earlier ones
o Rewards only depend on the current action!
- Markov Decision Processes
o M =⟨S , A , P , R , γ ⟩

,  A is the action space
o At each step t=1,2,3 , … , T
 Agent receives some representation of the environment's state
 Based on that information, agent selects an action
 Action modifies system’s state and generates a reward
o p ( s ' , r|s , a )=P [ S t +1=s' , R t+1 =r|S t =s , A t=a ¿
 p : S × R × S × A → [0, 1]
 fully defines the dynamics of the MDP
o State-transition probabilities
 p ( s '|s , a )=P [ St =s'|S t −1 =s , A t−1=a ]= ∑ p( s ' , r∨s , a)
r∈ R
o Expected reward for state-action pairs

'
p( s , r∨s , a)
r ( s , a )=E [ Rt|S t −1 =s , A t−1=a ]= ∑ r ∑ p (s ' , r ∨s ,a)=∑ r
r∈ R s' ∈S r ∈R p (s ' ∨s , a)
Goals, Rewards & Returns
- Goal/objective
o Is formalized as a reward signal from the environment to the agent
o What the agents wants to maximize over time (not immediate reward, but
cumulative reward)
- Rewards
o Formalized the objective: what we want to achieve (not how)
o Customize rewards such that maximizing them allows the agent to achieve an
objective
- Two types of tasks
o Episodic tasks (terminating tasks)
 Agent interacts with the environment for a finite number of steps
 Agent-env interaction breaks naturally into sub-sequences
 Have a clear terminal state
 Time of termination T is usually a random variable that can vary from episode
to episode
o Continuing tasks (non-terminating tasks)
 Agent interacts with the environment for an infinite number of steps
 Agent-env interaction doesn’t break naturally into sub-sequences
 No terminal state
- Returns
o Episodic tasks: Gt =Rt +1 + Rt +2+ R t +3+ …+ RT
 T: time of termination
o Continuing tasks: Gt =Rt +1 +γ R t+2 + y 2 Rt +3 +…
 Discount factor ƴ
 0≤γ≤1
 Determines the present value of future rewards
 γ=0 – Miopic agent: maximize only immediate reward
 γ=1 – Farsighted agent: take future reward into account
o Most Markov decision processes are discounted. Why?

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper donjaschipper. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €5,56. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 66579 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen
€5,56
  • (0)
  Kopen