Markov Decision Processes & Q-Learning Verified A+
Q: What is a Markov Decision Process (MDP)? ️️A: An MDP is a mathematical framework used to
describe an environment in decision making where outcomes are partly random and partly under the
control of a decision maker.
Q: How does Q-lear...
Q: What is a Markov Decision Process (MDP)? ✔️✔️A: An MDP is a mathematical framework used to
describe an environment in decision making where outcomes are partly random and partly under the
control of a decision maker.
Q: How does Q-learning work? ✔️✔️A: Q-learning is a model-free reinforcement learning algorithm
that learns the value of an action in a particular state by using Q-values, which are estimates of the
optimal action values.
Q: What is the role of the transition probability in an MDP? ✔️✔️A: The transition probability is the
probability that a particular action in a state will lead to a subsequent state. It is a key component in
defining the dynamics of an MDP.
Q: Define the reward function in the context of MDPs. ✔️✔️A: The reward function assigns a score to
each action at a particular state, which represents the immediate gain from that action, guiding the
agent toward its goal.
Q: What does 'policy' refer to in MDPs? ✔️✔️A: A policy is a strategy or a rule that defines the choice
of action based on the current state. It maps states to actions that maximize the long-term reward.
Q: Explain the Bellman equation. ✔️✔️A: The Bellman equation provides a recursive decomposition
for the value function of a policy. It expresses the value of a state as the sum of the immediate reward
and the discounted value of the next state.
Q: What is an episodic task in the context of reinforcement learning? ✔️✔️A: An episodic task is a task
that has a clear ending, at which point the agent resets to a starting state or a random state. Each
episode ends with a terminal state.
Q: How does temporal difference (TD) learning relate to Q-learning? ✔️✔️A: TD learning is a subset of
Q-learning where the agent learns directly from raw experience without a model of the environment's
dynamics, updating estimates based partially on other learned estimates.
, Q: What is the exploration-exploitation trade-off in Q-learning? ✔️✔️A: The exploration-exploitation
trade-off involves choosing whether to explore the environment to find better rewards in the future or
to exploit known rewards to maximize immediate gain.
Q: What are value functions in the context of MDPs? ✔️✔️A: Value functions estimate how good it is
for an agent to be in a given state, considering the amount of reward the agent expects to accumulate in
the future.
Q: Describe the Q-value or action-value function. ✔️✔️A: The Q-value function provides the value of
taking an action in a given state under a specific policy, predicting expected future rewards.
Q: What is the difference between model-based and model-free reinforcement learning? ✔️✔️A:
Model-based methods require knowledge of the environment's model (transitions and rewards),
whereas model-free methods, like Q-learning, do not use such knowledge and learn policies directly
from interactions with the environment.
Q: Explain the significance of the discount factor in reinforcement learning. ✔️✔️A: The discount
factor, denoted as gamma (𝛾), determines the present value of future rewards; a lower value places
more emphasis on immediate rewards, while a higher value favors long-term rewards.
Q: What does it mean for an MDP to be 'solved'? ✔️✔️A: Solving an MDP means finding an optimal
policy that maximizes the expected return from all states, typically through methods like value iteration
or policy iteration.
Q: How does the ε-greedy strategy mitigate the exploration-exploitation dilemma? ✔️✔️A: The ε-
greedy strategy involves choosing a random action with probability ε (exploration) and the best-known
action with probability 1-ε (exploitation), balancing the two approaches.
Q: What is the role of the learning rate in Q-learning? ✔️✔️A: The learning rate, or alpha (α),
determines the extent to which new information overrides old information. A higher learning rate
means that newer information is considered more heavily.
Q: Describe how the update rule in Q-learning adjusts the Q-values. ✔️✔️A: In Q-learning, the update
rule adjusts Q-values based on the difference between the estimated Q-value and the observed reward
plus the discounted maximum future Q-value, refining the policy to better predict optimal actions.
Voordelen van het kopen van samenvattingen bij Stuvia op een rij:
Verzekerd van kwaliteit door reviews
Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!
Snel en makkelijk kopen
Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.
Focus op de essentie
Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!
Veelgestelde vragen
Wat krijg ik als ik dit document koop?
Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.
Tevredenheidsgarantie: hoe werkt dat?
Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.
Van wie koop ik deze samenvatting?
Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper CertifiedGrades. Stuvia faciliteert de betaling aan de verkoper.
Zit ik meteen vast aan een abonnement?
Nee, je koopt alleen deze samenvatting voor €9,76. Je zit daarna nergens aan vast.