Machine Learning
Lecture 1 – Introduction
You can have collection of rules that tells the program what to do. You can write these rules by hand, and apply
them and test them. Then you notice that they work, or not work and can change it. You automating it, but you
are doing it by hand. With Machine Learning you take automation a bit further, we want the machine itself to
learn. How would that go? You need to collect some information about the distribution of words or sequences.
Learning from examples, based on supervised learning.
Find examples of SPAM and non-SPAM
Come up with a learning algorithm
A learning algorithm infers rules from examples
These rules can then be applied to new data (emails)
Types of learning problems
Machine Learning has an input space and an output space. The nature of the output determines which kind of
machine learning form/problem we are talking about.
Regression
Regression involves estimating or predicting a response. The response/the output variable takes continuous
values. Thus, a real number.
Predict person’s age
Predict price of a stock
Predict student’s score on exam
Binary classification
The output variable takes class labels, but classifies the output into two groups: a yes/no answer, e.g.
True/false or 1/0.
Detect SPAM
Predict polarity of product review: positive or negative
Predict gender: male or female
Multiclass classification
The output is one of a finite set of options. Involve mostly more than thousands of labels / classes / categories.
Each training point belongs to one of n different classes. The goal is to construct a function which, given a new
data point, will correctly predict the class to which the new point belongs to.
Classify subject newspaper articles: politics, sports, science, technology, health, etc.
Detect species based on photo: passer domesticus, calidris alba, etc.
Multilabel classification
Multilabel classification is a classification problem where multiple target labels can be assigned to each
observation instead of only one. A multilabel classifier has to product a vector of output values. The output is
based on yes/no answers. You can think of it as a binary classification.
Assign songs to one or more genres:
o {rock, pop, metal}
o {hip-hop, rap}
o {jazz, blues}
o {rock, punk}
Ranking
Order object according to relevance. Ranking models for information retrieval systems. Training data consists
of lists of items with some partial order specified between items in each list.
Rank web pages in response to user query
Predict student’s preference for courses in a program
,Sequence labelling
Type of pattern recognition task that involves the algorithmic assignment of a categorical label to each member
of a sequence of observed values (e.g. speech tagging). Input is a sequence of elements (words) and the
response is a corresponding a sequence of labels.
Labels words in a sentence with their syntactic category
Labels frames in speech signal with corresponding phonemes (W, ð, Ɛ, ɚ)
o N inputs | N inputs | N not necessarily = M | Sequence 2 sequence
o N outputs | M outputs | |
Autonomous behaviour
The input are measurements from sensors – camera, microphone, radar, accelerometer, etc. and the response
are instructions for actuators – steering, accelerator, brake, etc.
Supervise learning is very often improved with reinforcement learning: learn from the sequence. It works with
positive and negative learning. Supervised learning is not the end of the story, but sometimes it is not really
applicable. Unsupervised learning became a very important approach also.
In what situation do you use F1 score instead of accuracy?
___________________
___________________________________
Evaluation
How well is the algorithm learning? You can evaluate the performance by using different evaluation metrics.
Mean Absolute Error
The average absolute difference between true value and predicted value
Mean Squared Error
The average square of the difference between true value and predicted value.
The aforementioned metrics can be used for predicting age (regression, numerical output) with a preference to
MSE. The MSE exaggerates the outliers (/magnitude of big numbers), and the MAE does not.
Accuracy
Accuracy is calculated as the number of all correct predictions divided by the total number of the dataset. The
best accuracy is 1.0, whereas the worst is 0.0. It can also be calculated by 1- error rate.
(TP + TN) / (P + N)
Error rate
It is a proportion of mistakes The error rate is calculated as the number of all incorrect predictions divided by
the total number of the dataset. The best error rate is 0.0, whereas the worst 1.0.
(FP + FN) / (P + N)
Predicting gender could use accuracy or the error rate as evaluation metric. However, for flagging spam
purposes error rate is preferred. If accuracy is 99 percent, you would probably display the error rate instead.
Is there any disadvantage? The error rate does not take into account if a false negative is worse than a false
positive.
, Precision and recall
This metric is a useful measure of success of prediction when the classes are very imbalanced. In information
retrieval, precision is a measure of result relevancy, while recall is a measure of how many truly relevant results
are returned. Metrics which focus on one kind of mistakes. Is done sizes of certain sets.
Precision
The ratio of correctly predicted positive observations to the total predicted positive observations (of
all passengers that labeled as survived, how many actual survived? /what fraction of flagged emails
were real SPAMS?)
Recall
The ratio of correctly predicted positive observations to the all observations in actual class – yes (of all
the passengers that truly survived, how many did we label? / what fraction of real SPAMS were
flagged as SPAM?)
True Positives (TP) = the correctly predicted positive values
True Negatives (TN) = the correctly predicted negative values
False Positives (FP) = when actual class is no and predicted class is yes
False Negatives (FN) = when actual class is yes but predicted class is no
F-score
The harmonic mean between precision and recall. It is a kind of average aka F-measure. This score takes both
false positives and false negatives into account.
Fbeta
Parameter B quantifies how much more we care about recall than precision. It gives different importance
between precision and recall. F0.5 would mean that we care half as much about recall as about precision. The
beta parameter determines the weight of precision in the combined score. Beta < 1 lends more weight to
precision, while beta > 1 favors recall.
What is the difference between precision/recall, F-score and Fbeta?
F1 is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best
if false positives and false negatives have similar cost.
Macro-average (multi-class classification)
It computes the Fscore per-class, and average. It calculate metrics for each class independently, and find their
unweighted mean. This does not take label imbalance into account. The rare classes have the same impact as
frequent classes. This can be a good thing or a bad thing, depends on what you want.
Micro-average (multi-class classification)
This calculates metrics globally by counting the total number of times each class was correctly predicted and
incorrectly predicted. You do it by a case by case basis.
Treat each correct prediction as TP
Treat each missing classification as FN
Treat each incorrect prediction as FP