Machine Learning
Table of Content
Week 1 ..................................................................................................................................................................2
Lecture 1 Introduction ..........................................................................................................................2
Lecture 2 Linear Models and Search– 1................................................................................................4
Week 2 ..................................................................................................................................................................8
Lecture 3 Methodology – 1...................................................................................................................8
Lecture 4 Methodology – 2................................................................................................................ 15
Week 3 ............................................................................................................................................................... 20
Lecture 5 Probability – 1 .................................................................................................................... 20
Lecture 6 Linear Models – 2 ................................................................................................................. 25
Week 4 ............................................................................................................................................................... 32
Lecture 7 Deep learning..................................................................................................................... 32
Lecture 8 Probability – 2 .................................................................................................................... 40
Week 5 ............................................................................................................................................................... 45
Lecture 9 Deep Generative Models ................................................................................................... 45
Lecture 10 Probability – 1 .................................................................................................................... 48
Week 6 ............................................................................................................................................................... 52
Lecture 11 Probability – 1 .................................................................................................................... 52
Lecture 12 Probability – 1 .................................................................................................................... 56
Week 7 ............................................................................................................................................................... 58
Lecture 13 Reinforcement Learning .................................................................................................... 58
Quizes ................................................................................................................................................................ 61
Testquiz...................................................................................................................................................... 61
Quiz 1 ......................................................................................................................................................... 63
Quiz 2 ......................................................................................................................................................... 66
Quiz 3 ......................................................................................................................................................... 70
Exam: recall, applied knowledge or active knowledge
(calculating stuff).
Open book exam
,Week 1
Lecture 1 Introduction
What is machine learning?
Machine learning is often used in other software, in analytics, data mining, data science and statistics.
Machine learning: provides systems the ability to automatically learn and improve from experience without
being explicitly programmed.
Reinforcement learning: taking actions in a world based on delayed feedback.
Online learning: predicting and learning at the same time.
Offline learning: separate learning, predicting and action.
- Take a fixed dataset of examples (aka instance).
- Train a model to learn from these examples.
- Test the model to see if it works, by checking its predictions.
- If the model works, put it into production; i.e. use its predictions to take action.
In machine learning we have a problem (chess, driving car) break it into an abstract task (classification,
regression, clustering, etc.) we build an algorithm to solve the abstract task (linear model, kNN, etc.).
Supervised tasks: explicit examples of input and output. Learn to predict the output for an unseen input. E.g.
classification and regression. Use linear models, tree models and NN models.
Unsupervised tasks: only inputs provides. Find any pattern that explains something about the data. E.g.
clustering, density estimation, generative modelling.
2 supervised tasks Classification: assign a class to each example.
Regression: assign a number to each example.
ML is not AI, data science, data mining, information retrieval, statistics or deep learning:
Data science but not ML: gathering data, harmonising data and interpreting data.
More datamining than ML: finding common clickstreams in web logs. Finding fraud in transaction networks.
More ML than datamining: spam classification, predicting stock prices, learning to control a robot.
Statistics but not ML: analysing research results. Experiment design. Courtroom evidence.
More ML than statistics: Spam classification, movie recommendation.
Classification
Two spaces of machine learning: feature space (2D) and model space (3D)
Loss function: lossdata(model) = performance of model on the data (the lower the better). It maps a choice of
model to a loss for the current data. For classification: e.g. the number of misclassified samples.
Loss function for regression, aka mean-squared-errors loss: 𝑙𝑜𝑠𝑠(𝑝) = ∑ 𝑓 (𝑥 ) − 𝑦
The lower the loss function the better.
Few variations of classification: Features: usually numerical or categorical (binary).
Binary classification: two classes. VS Multiclass classification: more than 2 classes.
Multilabel classification: none, some or all classes may be true.
Class probabilities/score: the classifier reports a probability or score for each class.
Offline machine learning recipe:
- Abstract (part of) your problem to a standard task (e.g. classification, etc.).
- Choose your instances & their features (for supervised learning, choose a target)
- Choose your model class (linear model, decision tree, kNN).
- Search for a good model.
2
,Regression: features of instance i = xi; true label for xi = yi and model= f(xi) : loss(𝑓) = ∑ (𝑓(𝑥 ) − 𝑦 )
Unsupervised learning: clustering (e.g. k-means), generative modelling and density estimation.
Semi-supervised learning: e.g. self-training: you have a small set of labelled data (XL) and a large set of
unlabelled data (XU). Train classifier C on XL and then loop over: label XU with C and retrain C on XU + XL.
Self-supervised learning: a large set of un-notated data used without a lot of manual annotation. E.g. for a
natural language program. Thus a model which is build on structure. Often deep-leaning models.
The question is to whether to include sensitive attributes in data or not? To study bias we need these
attributes to be annotated. If we remove them they may be inferred from other features (postcode,
shopping habits, profile picture). Directly using a SA may be preferable to indirectly doing so. There are valid
use case (Race and sex affect medicine; it often requires a causal link).
What is input and what is target is not always clearly separated (embeddings, clustering, semi-supervised
learning, link prediction). Showing that sensitive attributes can be inferred, may serve as a warning to those
who are vulnerable.
Use sensitive attributes with extreme care: consider user communication over prediction; check the
distribution. Do not: imply causality or overrepresent what your predictions mean.
Machine learning is shallow: classification is a simplistic abstraction (male/female, race vs. ethnicity,
gay/straight, sex vs. gender). Models pick up on surface features first (even if deeper features are available).
Interpretability and responsibility is hard (We don’t know what models look at or how to make them look
elsewhere). 95% percent accuracy is not as impressive as it sounds (= 1 mistake in 20 attempts).
Never judge your model’s performance on the training data.
Solve: split your test and training data. Choose your model based on the training data. The aim is not to
minimise the loss on the training data, but to minimise the loss on your test data. You don’t get to see the
test data until you’ve chosen your model.
Machine learning is an empirical science.
Deductive reasoning: all men are mortal Socrates; is a man, therefore Socrates is mortal (discrete
unambiguous provable known rules).
Inductive reasoning: the sun has risen in the east every day of my life, so it will do so again tomorrow (fuzzy
ambiguous experimental unknown rules).
Simplicity, Occams razor: All else being equal, prefer the simpler solution.
3
, Lecture 2 Linear Models and Search– 1
Linear Regression
Notations used: lowercase non-bold for scalars, x, y, z scalar (i.e. single number)
lowercase bold for vectors and uppercase bold for matrices. , y, z vector (column of numbers)
xi : scalar element of vector x; Xij : scalar element of X X, Y, Z matrix (grid of numbers)
Xi : instance i in the data; xj : feature j (of some instance)
Features can be represented in a vector, with each element being a feature.
Model for one feature: f , (x) = 𝑤x + 𝑏
w is the weight (= coefficient) and b the bias (= intercept)
b = determines where the line crosses the vertical axis (when x = 0)
w = determines how much the line rises if we move one step to the right.
Model for two features: f , , (x) = 𝑤 x + 𝑤 x + 𝑏 (it spans a plane)
𝑤 𝑥
Model for n features: f , (x) = 𝑤 x + 𝑏 with ⋮ and ⋮
𝑤 𝑥
Example: try to predict blood pressure on job stress, healthy diet and age the three features.
f , (x) = 𝑤 x + 𝑤 x + ⋯ + 𝑏 = 𝑤 x + 𝑏
Where wTx is a dot product: w 𝐱 = w ∙ 𝐱 = ∑ 𝑤 𝑥 = ‖w‖‖𝐱‖ cos 𝛼
Which model fits best?
Use two more ingredients: loss function and search method.
Search for a well-fitting model: try to reduce the mean squared error (also called sum-of-
squares loss). Slight variations on the mean squared error :
Mean squared error loss:
𝑙𝑜𝑠𝑠 , (𝑝) = ∑ f x − t
𝑙𝑜𝑠𝑠 , (w, b) = ∑ w x + b − t
The loss function maps every point in the model space to a loss value. Here, the instance space is just the x
axis.
2. Searching for a good model
Model and feature space difference: most important spaces in machine
learning. Feature space: every example in your data is a point in this space.
Model space: the space where every model is a point. Lines in feature space
(wx + b) can be plotted in model space (with x axis = w and y axis is b) as a
point with the weight w and b.
Loss surface or loss landscape: plot the loss for every point in the model
space.
4