Machine Learning Week 1:
Hoorcollege 1 Introduction
Deductive reasoning: discrete, unambiguous, known rules
Inductive reasoning: ambiguous, experimental rules, unknown rules -> has to be broken
down to a few rules.
Machine learning: provides systems the ability to automatically learn and improve from
experience without being explicitly programmed (inside other software, in analytics/data
mining, science/statistics)
What makes a good ML problem?
● can’t solve it explicitly, approximate solutions are fine
● limited reliability, predictability, interpretability is fine
● plenty of examples to learn from
- e.g recommending a movie, predicting driving time
Abstract tasks generalize the problem of learning
Online learning: acting and learning at the same time
Reinforcement learning: online learning in a world based on delayed feedback
Offline learning: separate learning and acting - take a fixed dataset of examples, train a
model to learn from these examples, test the model to see if it works
Solve ML problems in general by using a few abstract tasks like classification, regression or
clustering.
1. Supervised: explicit examples of input and output, learn to predict the output for an
unseen input. (mailtjes naar spam of niet) labeled data
- Classification: assign a class to each example
Instances : data we provide our system (examples)
Features of the instance: things we measure
Target value: thing we are trying to learn
Dataset fed to learning algorithm -> produce a classifier: a small machine that
attempts to solve the learning problem.
Example: handwriting recognition, chess problem -> translate problems into
tasks.
Feature space: features plotted against each other
1. Linear classifier: draw a line, since a line is defined by two
parameters, we can also plot in a model space. (the two parameters of
the lines) We define a loss function which tells us how well a particular
model does for our data.
Loss function (model) = performance of model on the data. It takes the
model as input and the data as a constant
2. Decision Tree classifier: studies one feature in isolation at every node.
3. K-nearest neighbour: for a new point, it just looks at the k-points that
are closest and assigns the class that is most frequent in that set.
k = hyperparameter: you have to choose it yourself.
Variations:
Features usually numerical or categorical
, Binary classification: a task with two classes
Multiclass classification: more than two classes
Multilabel classification: none, some or all classes may be true (not
part of course)
Class probabilities/scores: the classifier reports a probability for each
class
- Regression: assign a number to each example. Predict a number instead of a
class. Output space and feature space in same model. We want to model the
relation between the features and the target
Loss function regression: 1/n Sum (f(xi) - yi)2 (MSE)
Regression tree: segment feature space into blocks and assign each a
number.
-> Overfitting (memorizing data instead of generalizing): training loss : the
loss for a given model on the training data. Never judge performance on
training data, leave some unseen data.
The aim of machine learning is not to minimise the loss on the training data,
but to minimise the loss on your test data. You don’t get to see the test data
until you’ve chosen your model!
The problem of machine learning is to choose a model that fits the pattern,
and ignores the noise.
2. Unsupervised: only inputs provided, find any pattern that explains something about
the data
- Clustering: ask the learner to split the instances into a number of clusters.
(unsupervised classification)
K-means: recursive nearest mean until the algorithm converges to a stable
state.
- Density Estimation: we want to learn how likely new data is. It thus produces
a number. The task of modelling the probability distribution behind your data.
- Generative Modeling: building a model from which you can sample new
examples is called generative modeling. -> deep learning
Hoorcollege 2: Linear Models 1
, Linear models and search: search methods are extremely versatile.
- Linear Regression: 1 feature f(x) = wx+b. If we have 2 features, each feature gets its
own weight. Multiple features, we arrange the features of a single instance into a
vector.
-> arbitrary number of features: dot product of the data and the weights and add a
bias. The weights and the bias are the parameters of the model.
Dot product!
Which model fits our data best?
1. Loss function: which tells us how well a particular choice of model does
2. Search method: search the space of all models for a particular model that
results in a low loss.
● Common Loss function for regression: MSE. Squares ensure that negative and
positive residuals don’t cancel out, but also that big errors affect the loss more
heavily than small errors.
-> the loss function maps every point in the model space to a loss value.
● Optimization: p^ = arg min loss X,Y (p) . We are trying to find the input (model
parameters) for which a particular function (the loss) is at its minimum. Find lowest
value as possible
1. Random search: Start with a random point p, loop: pick a point p’ close to p, if
loss p’ is smaller than loss p, p<p’. Analogy: hiker in a snowstorm!
It chooses the next point by sampling uniformly among all points with some
pre-chosen distance.
It works well because our problem is convex: if a line drawn between any two
points on the surface lies entirely above the surface. Implicates there is only 1
minimum.
-> What if the loss surface has multiple local minima? The random search will
go to one of the local minima and get stuck there.
Trick: simulated annealing: still pick the next point that isn’t better than the
current one, but only with a small probability. It still has some probability of
escaping.
In many situations local minima are fine.
● Variations in step size: fixed radius, random uniform or normally distributed.
● Space of linear models is continuous, if the space is discrete such as in tree models
you need to define which models are close to each other.
● Black Box Optimization: random search, simulated annealing. Very simple. We only
need to compute the loss function for each model. Can require many iterations. Also
works for discrete model spaces.
● Parallel Search: random search a couple of times independently.
- Introduce some form of communication between the searches to make it
more useful. Population Methods.
Evolutionary Algorithms: rank the population by loss, remove the worst half,
‘breed’ (crossover operator) a new population.
-> Slow + expensive for complex models, difficult to tune.
● Branching Search: the more closely we inspect our local neighbourhood, the faster
we converge.