Markov Decision Processes & Q-Learning Verified A+
Q: What is a Markov Decision Process (MDP)? ️️A: An MDP is a mathematical framework used to
describe an environment in decision making where outcomes are partly random and partly under the
control of a decision maker.
Q: How does Q-lear...
Q: What is a Markov Decision Process (MDP)? ✔️✔️A: An MDP is a mathematical framework used to
describe an environment in decision making where outcomes are partly random and partly under the
control of a decision maker.
Q: How does Q-learning work? ✔️✔️A: Q-learning is a model-free reinforcement learning algorithm
that learns the value of an action in a particular state by using Q-values, which are estimates of the
optimal action values.
Q: What is the role of the transition probability in an MDP? ✔️✔️A: The transition probability is the
probability that a particular action in a state will lead to a subsequent state. It is a key component in
defining the dynamics of an MDP.
Q: Define the reward function in the context of MDPs. ✔️✔️A: The reward function assigns a score to
each action at a particular state, which represents the immediate gain from that action, guiding the
agent toward its goal.
Q: What does 'policy' refer to in MDPs? ✔️✔️A: A policy is a strategy or a rule that defines the choice
of action based on the current state. It maps states to actions that maximize the long-term reward.
Q: Explain the Bellman equation. ✔️✔️A: The Bellman equation provides a recursive decomposition
for the value function of a policy. It expresses the value of a state as the sum of the immediate reward
and the discounted value of the next state.
Q: What is an episodic task in the context of reinforcement learning? ✔️✔️A: An episodic task is a task
that has a clear ending, at which point the agent resets to a starting state or a random state. Each
episode ends with a terminal state.
Q: How does temporal difference (TD) learning relate to Q-learning? ✔️✔️A: TD learning is a subset of
Q-learning where the agent learns directly from raw experience without a model of the environment's
dynamics, updating estimates based partially on other learned estimates.
, Q: What is the exploration-exploitation trade-off in Q-learning? ✔️✔️A: The exploration-exploitation
trade-off involves choosing whether to explore the environment to find better rewards in the future or
to exploit known rewards to maximize immediate gain.
Q: What are value functions in the context of MDPs? ✔️✔️A: Value functions estimate how good it is
for an agent to be in a given state, considering the amount of reward the agent expects to accumulate in
the future.
Q: Describe the Q-value or action-value function. ✔️✔️A: The Q-value function provides the value of
taking an action in a given state under a specific policy, predicting expected future rewards.
Q: What is the difference between model-based and model-free reinforcement learning? ✔️✔️A:
Model-based methods require knowledge of the environment's model (transitions and rewards),
whereas model-free methods, like Q-learning, do not use such knowledge and learn policies directly
from interactions with the environment.
Q: Explain the significance of the discount factor in reinforcement learning. ✔️✔️A: The discount
factor, denoted as gamma (𝛾), determines the present value of future rewards; a lower value places
more emphasis on immediate rewards, while a higher value favors long-term rewards.
Q: What does it mean for an MDP to be 'solved'? ✔️✔️A: Solving an MDP means finding an optimal
policy that maximizes the expected return from all states, typically through methods like value iteration
or policy iteration.
Q: How does the ε-greedy strategy mitigate the exploration-exploitation dilemma? ✔️✔️A: The ε-
greedy strategy involves choosing a random action with probability ε (exploration) and the best-known
action with probability 1-ε (exploitation), balancing the two approaches.
Q: What is the role of the learning rate in Q-learning? ✔️✔️A: The learning rate, or alpha (α),
determines the extent to which new information overrides old information. A higher learning rate
means that newer information is considered more heavily.
Q: Describe how the update rule in Q-learning adjusts the Q-values. ✔️✔️A: In Q-learning, the update
rule adjusts Q-values based on the difference between the estimated Q-value and the observed reward
plus the discounted maximum future Q-value, refining the policy to better predict optimal actions.
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller CertifiedGrades. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $9.99. You're not tied to anything after your purchase.