Lecture 1. Introduction to Machine Learning
What is machine learning (ML) about?
ML is about automation of problem solving.
It is the study of computer algorithms that improve automatically through experience,
(involves becoming a better at a task (T) based on some experience(E) with respect to
some performance measure (P).
Examples:
- Spam detection
- Movie recommendation
- Speech recognition
- Credit risk analysis
- Autonomous driving
- Medical diagnosis
What does it involve?
- ML may involve a notion of generalization. It is safe to assume that current
observation can be generalized to future observation (able to work on unseen data,
we assume that the data points represent real-world data).
- Annotated data, objective, optimization algorithm, features/representations,
assumptions are some critical components.
Different types of learning
A good starting point:
- Supervised learning: annotated/labelled dataset/ground truth
Classification: discrete variable
Regression: continuous variable
- Unsupervised learning: unlabeled dataset
Clustering
SPAM versus non-SPAM
Binary classification problem
Learning process
- Find examples of two classes: SPAM and non-SPAM
- Come up with a learning algorithm
- A learning algorithm infers rules from examples: If (A or B or C) and not D, then
SPAM (for example decision trees)
- These rules can then be applied to new data (emails)
0
,Machine learning examples
- Recognize handwritten numbers and letters
- Recognize faces in photos
- Determine whether text expresses positive, negative or no opinion
- Guess person’s age based on a sample of writing
- Flag suspicious credit-card transactions (binary classification task)
- Recommend books and movies to users based on their own and others’ purchase
history
- Recognize and label mentions of people’s or organization names in text
Types of learning problems: Regression
- Response a (real) number
- Predict person’s age
- Predict price of stock
- Predict student’s score on exam
Types of learning problems: Binary Classification
- Response YES/NO answer
- Detect SPAM
- Predict polarity of product review: positive vs negative expressions
Types of learning problems: Multiclass classification
More than two labels/classes, one way to solve the classification is by extension of logistic
regression. Another learning problem multi label classification: outcome can be link with
different labels (not all the labels should be correct)
Response: one of a finite set of options
- Classify newspaper article as
o Politics, sports, science, technology, health, finance
- Detect species based on photo
o Passer domesticus, Calidris alba, streptopelia decaocto, Corvus corax
- Assign songs to one or more genders:
o Group different classes together as pop, r&b together
Types of learning problems: Autonomous behavior
- Input: measurements from sensors – camera, microphone, radar, accelerometer
- Response: instructions for actuators (make right decisions like steering, accelerator,
brake… we don’t want to kill anyone on the road.
How well is the algorithm learning?
Evaluation: choose a baseline, choose a metric, compare your learning with baseline!
Different tasks, different metrics:
- Predicting age
- Flagging spam (imbalanced data)
1
,Evaluation of Regression Problems (metrics)
- Mean Absolute Error – the average (absolute) difference between true value and
predicted value (yn true value (ground truth), ŷn (predicted value), measures the
average magnitude of the errors in a set of predictions, without considering their
direction.
- Mean Squared Error – the average square of the difference between true value and
predicted value – more weighted/sensitive to outliers, measures how close a fitted
line is to data point.
Evaluation for Classification: Predicting SPAM
- Accuracy: measures how close a measurement is to the true or accepted value.
TP+TN
Accuracy=
TP+ FP+TN + FN
¿ of incorrect classification
Error rate be classified ¿(missclassification rate)
Total number of data points ¿
Not informative if data is unbalanced.
Classification
- False Positive (FP) – flagged as SPAM, but are not-SPAM (bigger issue for this
problem)
- False Negative (FN) – not flagged, but is SPAM
What about medical diagnosis?
Correct classification
- True Positive (TP): SPAM classified as SPAM
- True Negative (TN): Not-SPAM classified as Not-SPAM
Precision and Recall
Metrics which focus on one kind of mistake:
- Precision: the number of positive class predictions that actually belong to the
positive class (what fraction of flagged emails were real SPAMs?)
True Positive
True Positive+ False Positive
- Recall: quantifies the number of positive class predictions made out of all positive
examples in the dataset (what fraction of real SPAMs were flagged?)
True Positive
True Positive+ False Negative
2
, F-score/ F-measure
Harmonic mean between precision and recall a kind of average
2∗Precision∗Recall
F 1=
( Precision+ Recall )
Fβ
Parameter β quantifies how much more we care about recall then precision, when it is
greater than 1, that means, recall is weighted more, when it is smaller than 1, that means
precision is weighted more.
Example 2. Multiclass classification
Data point (2) is FN for SPAM, FP for OK
Data point (4) is FN for PHISH, FP for SPAM
Precision true positives over labeled positives
Recall true positives over actual positives
- Compute precision and recall per-class, and average:
1 1
+ +1
PS = ½, PO = ½, PP = 1/1, = Ps + Po+ Pp 2 2
=
3 3
- Rare classes have the same impact as frequent classes
Micro-average
- Micro average is the study of the individual class.
- Weights each sample equally
- Aggregate the contributions of all classes to compute the average metric
- Micro Average Precision is the sum of all true positives and divides by the sum of all
true positives plus the sum of all false positives. So basically, you divide the number
of correctly identified predictions by the total number of predictions.
3