Misses some info and not everything is clear or well explained
Seller
Follow
femkestokkink
Reviews received
Content preview
Machine Learning Summary
LECTURE 1: INTRODUCTION
Supervised learning: explicit examples of input and output and we learn to predict the output for
an unseen input
Unsupervised learning: only inputs are provided and we nd any pattern that explains something
about the data
Classi cation: assigning a class to each example
• The examples in the dataset are instances
• The things we measure for the examples are the features
• The thing we are trying to learn is the label (target value)
• The dataset is fed to a learning algorithm, which produces a classi er as model that
tries to predict the correct class
• Examples of classi ers: linear, decision tree and nearest neighbour classi er
Feature space: every feature is represented by an axis and every instance is a point in the graph
Model space: every parameter of the model is represented by an axis, and the points are
possible models (e.g. lines in the feature space).
• We want to search the model space for a model that ts the data well, use loss function
• Loss function: the performance of the model on the data, the lower the better.
• Loss function has as parameter the model and the data as a constant value.
Variations of classi cation:
• Features are either numerical or categorical
• Binary classi cation: 2 classes
• Multi-class classi cation: more than 2 classes
• Multi-label classi cation: none, some or all labels may be true
• Class probabilities: the classi er reports a probability for each class
Basic recipe: abstract your problem to a standard task, choose your instances and their features,
choose your model class and then search for a good model.
Regression: assigning a number to each example. A good loss function for regression is the
mean-squared-error loss (MSE) : loss( f ) = n −1Σi( f (xi ) − ti )2
Clustering: we ask the learner to split the instances into a number of clusters. The number of
clusters is usually given beforehand by the user. An example is k-means clustering.
Density estimation: predict a number for each instance, and that number expresses how likely
the model thinks the given instance is.
Generative modelling: building a model from which you can sample new examples
Self-supervised learning: to di erent ways in which a large unlabelled dataset can be used to
train a model in such a way that no or little annotation is required.
When the model is tting details of the dataset that are random noise, the model is over tting.
When a model over ts, we sometimes say that it is memorising the data, when it should be
generalising.
1 of 18
fi fi fifififi fiff fi fi fi fi fi
, LECTURE 2: LINEAR MODELS AND SEARCH
Linear regression: fw,b(x) = w1x1 + w2 x2 + . . . + wn xn + b = w T x + b = Σi wi xi + b
• Each feature gets its own weight wi and we add a bias b
−1 T 2
• MSE loss: lossX,T (w, b) = n Σj(w xj + b − tj ) , other variations possible too
The loss function maps each point in the model space to a loss value. If we plot these values, we
get the loss surface or loss landscape. To minimise the loss, we need to nd the brightest point
in the loss surface or the lowest point in the loss landscape. Important: minimise the loss on the
test data, seeing only the training data. Optimise: p̂ = argp min lossX,T ( p)
One way to nd the optimal parameters is with random search: make a small step to a nearby
point, then if the loss goes up we move back to our previous point and if it goes down we stay in
the new point and we repeat the process.
• This works well because our problem is convex (bol), this implies that any point that looks
like a minimum locally, is a global minimum.
• If the loss is not convex and there are multiple local minima, it can get stuck at a local
minimum and thus not reach the optimum.
• Simulated annealing: if the next point chosen isn’t better than the current one, we still
pick it, but only with some small probability. This means that whenever it gets stuck in a
local minimum, it still has some probability of escaping, and nding the global minimum.
• Note that doesn't mean that local minima are always bad solutions.
• Parallel search is another option where you run random search a couple of times
independently and if you’re lucky one of these runs may start you o close enough to the
global minimum. For simulated annealing, doing multiple runs makes less sense
• An example of a population method is an evolutionary algorithm: e.g. start with a
population of 50 models, and compute the loss for each, then kill the worst 50% and keep
the best 50%. Then create a new population by randomly pairing up parents from the best
50%, and taking the point halfway between the two parents, with a little noise added. We
take these as the new population and repeat the process.
• All these search methods are instances of black box optimisation: they are simple, only
need to compute the loss, it can require many iterations and it works for discrete models.
Gradient descent: using calculus we can nd the direction in which the loss drops most quickly
( steepest descent), this direction is the opposite of the gradient (direction of steepest ascent).
Gradient descent takes small steps in this direction in order to nd the minimum of a function.
∂f ∂f
Gradient = ∇f (x, y) = (, )
∂x ∂y
Gradient descent algorithm: p ← p − η ∇loss( p)
• You pick a random point p: compute the gradient and subtract it from the current choice,
and then repeat the process. We use learning rate η to scale down the step size
2
Σ (wxi + b − ti )xi
• (b) (b)
w w n i
← − η 2
gradient
Σ
n i
(wx i + b − ti )
next guess current guess
+ gradient descent is very fast, uses low memory, very accurate and the backbone of 99% of ML
- only works for continuous model spaces with smooth loss functions for which we can work out
the gradient, and it does not escape local minima
Linear classi er: classi er whose decision boundary is always a line (or hyperplane) in feature
space. We can use the line w T x + b, where w is perpendicular to the decision boundary.
• 3 common loss functions: least squares loss, log loss / cross entropy, SVM loss
T 2 T 2
• Least-squares loss: loss(w, b) = Σi∈pos(w xi + b − 1) + Σi∈neg(w xi + b − 1)
2 of 18
fi fi fi fi fi fi ff fi
, LECTURE 3: MODEL EVALUATION
Binary classi cation: positive class and negative class, where the classi er is often a detector
for the positive class.
• Error: proportion of misclassi cations
• Accuracy: proportion of correct classi cation
Never judge performance of your model on the training data: withhold some data (test data) on
which you can test the performance of your model trained on training data.
• Proportion between test and training data is not important, absolute size is
Hyperparameter: parameter you choose before you start running the algorithm, it is used to
control the learning process.
Only run once on the test data because the more you use the test data, the less reliable your
conclusions become. Reusing test data causes you to pick the wrong model and in ates your
performance estimate.
You can also include a validation set, that is ideally the same size as the test set (or bit smaller)
• During model & hyperparameter selection you train on training set + test on validation set
• For nal run you train on training and validation set and test on test set
Cross-validation: split the training data into 5 folds and for each speci c choice of
hyperparameters that you want to test, you do 5 runs: each with one of the folds as validation
data. Then average the scores of these runs.
• Useful when you are left with little training data if you split o test and validation sets.
• Can be costly because you need to train 5 times as many classi ers
• After selecting your hyperparameters with cv, you still test once on the test data
• For cross-validation in time sensitive data, you have to slice your dataset s.t. you start with
a small set and add a litter more data every time with validation set as last data.
Which hyperparameters to try?
• Trial-and-error: most often used (use intuition)
• Grid search: de ne a set of nite values per hyperparameter, try all combinations
• Random search
For automated search, it is often better to try random samples in the hyper parameter space than
exhaustively checking a grid. If one parameters turns out not to be important, and another does, a
grid restricts us to only three samples over the important parameter, at the cost training nine
di erent models. If we randomise the samples, we get nine di erent values of each parameter at
the same cost.
We can observe the sample accuracy of classi er C on the sample but we we can’t observe the
true accuracy of C, therefore we may want to compute a con dence interval: interval on the
values of our sample accuracy that captures a given proportion of the probability mass (95%)
• We can de ne precisely what distribution we can expect on the value of the sample
accuracy if we keep the classi er and the true accuracy xed, but resample the test data
• The larger the test set size, the smaller the con dence interval. Avoid small test sets.
We can observe the estimated CI but not the true CI.
• Don’t say: the probability that the true mean is in this CI is 95%
• Do say: if we repeat the experiment many times, computing the CI each time; the true
mean would be inside the interval in 95% of those experiments.
If we sample more data, the estimate of our standard deviation (measure of spread, variance)
becomes more accurate. The error bars don't get smaller, they just get closer to their correct size.
The standard error and the con dence interval are indicators of how con dent we are about our
estimate of the value the bar represents. For these, the more data we have, the smaller they get.
3 of 18
ff fi
fi fi fi fififi fi fi fi fi fi ff fffifi fi fi fi fl
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller femkestokkink. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $7.81. You're not tied to anything after your purchase.