Tentamen (uitwerkingen)

Markov Decision Processes Finals V2

0 keer bekeken 0 keer verkocht

Vak
Ma-rk-ov Decision Processes Fin V2

Instelling
Ma-rk-ov Decision Processes Fin V2

Markov Decision Processes Finals V2 A Markov Process is a process in which all states do not depend on previous actions. ️️True, Markov means that you don't have to condition on anything past the most recent state. A Markov Decision Process is a set of Markov Property Compliant states, wi...

[Meer zien]

Voorbeeld 2 van de 14 pagina's

Bekijk voorbeeld

Geupload op 30 oktober 2024
Aantal pagina's 14
Geschreven in 2024/2025
Type Tentamen (uitwerkingen)
Bevat Vragen en antwoorden

markov decision processes finals v2

Instelling Ma-rk-ov Decision Processes Fin V2
Vak Ma-rk-ov Decision Processes Fin V2

Volgen

CertifiedGrades

Lid sinds 1 jaar 80 documenten verkocht

€10,89

Ook beschikbaar in voordeelbundel v.a. €23,57

Toegevoegd

In winkelwagen

Op verlanglijstje

100% tevredenheidsgarantie
Direct beschikbaar na betaling
Zowel online als in PDF
Je zit nergens aan vast

Ook beschikbaar in voordeelbundel (1)

Markov Decision Processes Complete Bundle Compilation

€ 57,70 € 23,57 6 items

1. Tentamen (uitwerkingen) - Markov decision processes verified solutions
2. Tentamen (uitwerkingen) - Markov decision processes & q-learning verified a+
3. Tentamen (uitwerkingen) - So 2 markov decision processes
4. Tentamen (uitwerkingen) - Reinforcement learning + markov decision processes
5. Tentamen (uitwerkingen) - Markov decision processes
6. Tentamen (uitwerkingen) - Markov decision processes finals v2
Meer zien

Markov Decision Processes Finals V2

A Markov Process is a process in which all states do not depend on previous actions. ✔️✔️True,
Markov means that you don't have to condition on anything past the most recent state. A Markov
Decision Process is a set of Markov Property Compliant states, with rewards and values.

Decaying Reward encourages the agent to end the game quickly instead of running around and
gathering more reward ✔️✔️True, as reward decays the total reward for the episode decreases, so
the agent is encouraged to maximize total reward by ending the game quickly.

R(s) and R(s,a) are equivalent. ✔️✔️True, it just happens that it's easier to think about one vs the
other in certain situations.

Reinforcement Learning is harder to compute than a simple MDP. ✔️✔️True, you can just use the
Bellman Equations for an MDP, but Reinforcement Learning requires that you make observations and
then summarize those observations as values.

An optimal policy is the best possible sequence of actions for an MDP. ✔️✔️True, with a single caveat.
The optimal policy is a policy that maximizes reward over an entire episode by taking the argmax of
resulting values of actions + rewards. But MDPs are memoryless, so there is no concept of "sequence"
for a policy.

Temporal Difference Learning is the difference in reward you see on subsequent time steps.
✔️✔️False, Temporal Difference Learning is the difference in value estimates on subsequence time
steps.

RL falls generally into 3 different categories: Model-Based, Value-Based, and Policy-Based. ✔️✔️True,
Model-Based is essentially using the Bellman Equations to solve a problem, Value-Based is Temporal
Difference Learning, and Policy-Based is similar to Value-Based, but it solves in a finite amount of time
with a certain amount of confidence (in Greedy it's guaranteed).

, TD Learning is defined by Incremental Estimates that are Outcome Based. ✔️✔️True, TD Learning
thinks of learning in terms of "episodes", which it uses to estimate the transition functions rather than
having a predefined model.

For a learning rate to guarantee convergence, the sum of the learning rate must be infinite, and the sum
of the learning rate squared must be finite. ✔️✔️True, this is called a contraction mapping and it
guarantees convergence.

All of the TD learning methods have set backs, TD(1) is inefficient because it requires too much data and
has high variance, TD(0) has a maximum likelihood estimate but is hard to calculate for long episodes.
✔️✔️True, this is why we use TD(Lambda), which has many of the benefits of TD(0) but is much more
performant. Empirically, lambdas between 0.3 and 0.7 seem to perform best.

To control learning, you simply have the operator choose actions in addition to learning. ✔️✔️True,
states are experienced as observations during learning, so the operator can influence learning.

Q-Learning converges ✔️✔️True, the Bellman Equation satisfies a Contraction Mapping where the
sum of all is infinite, but the sum of all squared is less than infinite. It always converges to Q*.

As long as the update operators for Q-learning or Value-iteration are non-expansions, then they will
converge. ✔️✔️True, there are expansions that will converge, but only non-expansions are
guaranteed to converge independent of their starting values.

A convex combination will converge. ✔️✔️False, it must be a fixed convex combination to converge. If
the value can change, like with the Boltzmann exploration, then it is not guaranteed to converge.

In Greedy Policies, the difference between the true value and the current value of the policy is less than
some epsilon value for exploration. ✔️✔️True

It serves as a good check for how long we run value iteration until we're pretty confident that we have
the optimal policy. ✔️✔️True

For a set of linear equations, the solution can be found in polynomial time. ✔️✔️True

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

√ Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, Bancontact of creditcard voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper CertifiedGrades. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €10,89. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 71498 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen

Populaire universiteiten

Populaire hogescholen

Populaire studieboeken voor Communicatie en Taal

Populaire studieboeken voor Economie en Bedrijf

Populaire studieboeken voor Exact en Informatica

Populaire studieboeken voor Gedrag en Maatschappij

Populaire studieboeken voor Gezondheid en Geneeskunde

Populaire studieboeken voor Recht en Bestuur

Tentamen (uitwerkingen)

Markov Decision Processes Finals V2

Document informatie

Onderwerpen

Geschreven voor

Verkoper

Ontvangen beoordelingen

Voorbeeld van de inhoud