Machine learning summary
Chapter 5
Probability is a way to measure how likely something is to happen. There are two kinds of
probability.
The first kind is called objective probability. This means that we can only know the
probability of something happening by doing the same experiment over and over again. For
example, if we roll a fair die many times, we can see that the number 6 comes up about one
time for every six rolls. This means that the probability of rolling a 6 is one in six. This
probability is the same for everyone, no matter what they think or know.
The second kind is called subjective probability. This means that probability is based on how
much we know or believe about something. For example, if we are not sure if it's going to
rain tomorrow, we can say there is a 50% chance it will rain and a 50% chance it won't rain.
This probability is different for each person because it depends on what they think or
believe.
So, objective probability is about what happens in real life, while subjective probability is
about what we think might happen.
Sample space is a way to talk about all the possible outcomes of a situation. For example, if
we flip a coin, the sample space is the set of all possible outcomes: heads or tails. There are
two types of sample space: discrete and continuous.
A discrete sample space is one where there are only a limited number of possible outcomes.
For example, when rolling a die, the sample space is the numbers 1 through 6.
On the other hand, a continuous sample space is one where there are an infinite number of
possible outcomes, like when we use real numbers to describe something, such as the height
of a person or the temperature outside.
The event space is the set of all possible events that can happen within the sample space.
For example, if we roll a die, an event might be getting an odd number. Another event could
be getting a number greater than 4.
Lastly, the power set is a set that contains all possible subsets of another set. For example, if
we have a set of three numbers, its power set would contain all possible combinations of
those numbers, including the empty set and the set containing all three numbers. The power
set is useful in probability because it allows us to talk about all possible events that could
happen within the sample space.
The inversion problem is when we want to figure out the probability of a hidden cause, given
an observable event. For example, if we see a car accident, we might want to know what
caused the accident, even if we didn't see the cause directly.
,Bayes' rule is a mathematical formula that helps us solve the inversion problem. It says that
the probability of a cause given an effect is equal to the probability of the effect given the
cause, times the prior probability of the cause, divided by the prior probability of the effect.
In simpler terms, Bayes' rule helps us update our beliefs about the probability of a cause
based on new information. For example, if we hear that it was raining at the time of the car
accident, we might update our belief about the probability that wet roads caused the
accident.
So, the inversion problem is when we want to figure out the cause of an observable event,
and Bayes' rule is a way to help us do that by using probabilities and updating our beliefs
based on new information.
Theta = parameters
The maximum likelihood principle is used by frequentists to estimate the value of a
parameter based on observed data, while Bayesians use probability distributions to express
their beliefs about the parameter before and after seeing the data. The prior distribution
represents beliefs about the parameter before seeing the data, and the posterior distribution
represents beliefs about the parameter after seeing the data.
MAP model is a way to estimate something using both frequentist and Bayesian methods. It's
like a game where you want to guess something based on clues. You use the clues you have
and what you already know to make your guess. MAP model is like that, it uses the
information it already knows and the new information it gets to make a guess. This guess is a
number that is somewhere in between what the frequentists and Bayesians would guess.
This guess is often better than just using the frequentist method (max likelihood principle).
To make this guess, MAP model multiplies something called "likelihood" with something
called "prior."
Probabilistic classifiers are like teachers who not only give you an answer, but also tell you
how sure they are about it. They can give you a probability for each possible answer. For
example, if they think an email is either spam or not spam, they can tell you the probability
that it is spam (let's say 0.1) and the probability that it's not spam (let's say 0.9). We can use
these probabilities to do different things. We can use them to make a ranking of the possible
answers, or to see how sure the classifier is about its answer. If we don't need the
probabilities, we can just use the answer with the highest probability and treat it like a
regular classifier.
Sure, let's say we want to build a system that can recognize images of cats and dogs. Here's
how the two approaches would work:
Discriminative approach: We would train a machine learning model to directly predict the
probability of each class (cat or dog) based on the features of the image (X), such as the
color, texture, and shape. Once the model is trained, we can use it to classify new images as
either cats or dogs based on the probabilities it outputs.
, Generative approach: We would train two probability models, one for cats and one for dogs.
These models would learn the probability distribution of the features of each class (X|Y) and
the prior probability of each class (P(Y)). Given a new image, we would use Bayes' rule to
calculate the probability of the image belonging to each class (P(Y|X)). We can also use the
generative models to generate new examples of cats or dogs by sampling from the learned
distributions.
Naive Bayes is a simple but effective way to figure out if an email is spam or not. It assumes
that each thing they look at in the email (like if it has certain words or if it came from a
certain person) is not related to any of the other things they look at. They just figure out how
often each thing shows up in spam emails and in non-spam emails, and use that to guess if
the new email is spam or not. If they don't have any information about one of the things in
the email, they can't use Naive Bayes to figure out if it's spam or not.
Laplace smoothing is a way to avoid getting zero probabilities when we are calculating
probabilities based on limited data. We add fake instances to the data set that have all
possible values of the features, so we can have a non-zero probability for each possible
value. For example, if we are classifying emails as spam or ham, and we don't have any
examples of emails with both "T" and "F" values for all features, we add a fake email to the
data set with those values so we can still calculate probabilities for that combination of
feature values.
Imagine you want to write a secret message to your friend using only 0's and 1's. You could
say that "0" means "yes" and "1" means "no", for example. But, if you use a normal code, like
"00" for "yes" and "01" for "no", it would be hard for your friend to understand where one
"yes" ends and where one "no" begins. So, a prefix-free code is a special way of writing your
message, where no code is the beginning of another code. This way, your friend can easily
understand the message by just reading the 0's and 1's in order, without any extra signs to
separate them.
Arithmetic coding is a way to turn data into a code that takes up less space. It works by giving
shorter codes to outcomes that happen more often. We can do this by modeling the
probability of each outcome, and then using that to create a code that represents the data.
The important thing is that the difference in code length for each outcome is never more
than one bit. This means we can ignore one bit in our calculations. In the end, every code
gives us a distribution and every distribution gives us a code.
MDL (Minimum Description Length) Principle = A model that can compress data has learned
something about the data, and the better the compression, the more we’ve learned. The
MDL balances model complexity by storing the model and then the data given the model. It
says that compression and learning are strongly related.
Think of it like a game of communication between a sender and a receiver. The sender sees
some data and comes up with a scheme to send it to the receiver. The sender and the
receiver are allowed to come up with any scheme they like before seeing the data. But
afterwards, the data must be sent using the scheme and in a way that is perfectly decodable
by the receiver without further communication. We assume that there is some language to