INTRODUCTION TO MACHINE LEARNING
CONTENT
Gradient boosting 15
introduction 2 Stacking (heterogeneous) 15
Terminology 2 Forest models 16
Pre-processing 3 Random Forest 16
Data Scaling 3
Missing value imputation 3 model evaluation and learning with imbalanced
Categorical variables 3 data 17
Tuning parameters 3 Bias + Variance + irreducible error 17
Evaluate using training and test sets 17
K-Nearest Neighbours (KNN) 4 Evaluate using cross-validation 17
KNN Classifiers 4 Evaluation of Binary classification 18
KNN regressions 4 Imbalanced data 18
KNN Strengths, weaknesses, parameters 4
Nearest centroid 4 Feature engineering 19
Working with images 19
Linear regression 5 Transforming text data 19
OLS regression 5 Feature selection 19
Solving OLS minimization 5
Reducing the magnitude of coefficients 5 Dimensionality reduction 20
PCA (principle component analysis) 20
Logistic regression 6 Computing PCA 21
Logistic regression 6 PCA in higher dimensions 21
Multiclass classifications 7 Non-negative matrix factorization 21
Manifold learning:For data visualization 22
Neural networks/Multi-Layer perceptrons 8 T-distributed Stochastic Neighbor Embedding 22
The neural network (NN) approach 8
Popular activation functions 8 Clustering 23
K-means clustering 23
Support vector machines 10 MiniBatchKMeans 23
linear Support vector machines (SVM) 10 Feature extraction using K-means 23
Kernel svms 11 Hierarchical clustering 24
Similarity (kernel) functions 11 Agglomerative hierarchical clustering
SVM regression 11 techniques 24
Pros and cons 24
Naïve bayes and decision trees 12 Density based clustering methods 24
naïve bayes 12 Density-based spatial clustering of applications
Decision trees 13 with noise (DBSCAN) 24
Model complexity 13 Mixture models 25
Decision Tree regressions 13 Gaussian mixture models 25
Evaluating clustering results 25
Ensemble learning, boosting, random forests 14 Silhouette coefficient 25
Ensemble learners 14
Voting (heterogeneous) 14 Comparing different models 26
Bagging (homogenous) 14 Choosing models 26
Boosting (homogenous) 14 What the models look like 27
AdaBoost 15 Examples 27
1
,INTRODUCTION
Machine learning Extracting knowledge from data, at the intersection of statistics, AI, and computer science.
This is used when we need to make sense of unstructured data.
It is used to predict values or to learn something previously unknown.
SUPERVISED LEARNING REINFORCEMENT LEARNING
Algorithms that learn from input/output pairs. This is Reasoning under uncertainty for optimal
used to automate manual labor. decisions. How agents should take actions in an
Given 𝑫 = {𝑿𝒊 , 𝒀𝒊 }, the model will learn 𝑭: 𝑿𝒌 → 𝒀𝒌 environment to maximize a reward.
Given
UNSUPERVISED LEARNING 𝑫 = {𝒆𝒏𝒗𝒊𝒓𝒐𝒏𝒎𝒆𝒏𝒕 ሺ𝒆ሻ, 𝒂𝒄𝒕𝒊𝒐𝒏 ሺ𝒂ሻ, 𝒓𝒆𝒘𝒂𝒓𝒅 ሺ𝒓ሻ}
Only the input data is known, no output data (labels) learn policy and utility functions:
is provided. It can be useful for outlier detection. policy 𝑭𝟏 : {𝒆, 𝒓} → 𝒂 and utility 𝑭𝟐 : {𝒂, 𝒆} → 𝒓
Given 𝑫 = {𝑿𝒊 }, group/cluster the data into 𝑭: 𝑿𝒊 → 𝒀𝒋
SEMI-SUPERVISED
Combine supervised and unsupervised models. This is useful when only a part of the data is labelled
e.g. Based on past information on spam emails, you can filter new incoming mails into Inbox and Spam
ACTIVE
Combine supervised and reinforcement models. You get feedback from the model
e.g. Speech automated systems train your voice and then start working based on this training.
TERMINOLOGY
Label/class The target variable (𝒚) of an instance (datapoint)
Features/attributes The input data (𝑿). The attribute values are feature values, summarized in a feature vector
Model An equation that links the values of features to the predicted value of the target variable
Generalization When a model can make accurate predictions on unseen data, it can generalize from the
training set to the test set.
Score functions Also fit statistics or score metrics. Measures how well the model fits the data
Feature selection Reduce the number of predictors by selecting the important ones (dimensionality reduction)
Feature extraction Reduce the number of predictors by means of mathematical operations (PCA)
Structured data Highly organized data, made up of mostly tables with rows and columns
Unstructured data Unorganized data, for example texts, images etc.
Classification Discrete output, it predicts a class label. Train to find decision boundaries to separate classes
Regression Continuous output, it predicts a value. Train to fit the data and describe relations
One vs rest An approach to use binary classification algorithms on multiclass datasets.
The model is learned for each class separately. For predictions, all classifiers are run on the test
point, the one with the highest score wins.
Pipelines Create a workflow that can execute a sequence of tasks at once
Parameters Variables that are learned during the training of the model
Hyperparameters Variables of which the value is set prior to training the model
Overfitting A model that is too complex for the available data. It is fit too closely to the training set, and
cannot generalize on new data. It also fits to the noise in the training dataset.
It can be avoided by evaluating with separate testing data
Underfitting A model that is too simple for the available data. It will underperform on both the training and
testing sets. Not all aspects and variability in the data are captured by the model.
Dataset size This is intimately tied to model complexity, more data can lead to more complex and
accurate models with lower risk of overfitting
Intuition derived from datasets with few features (low-dimensional datasets) might not hold up
in high-dimensional datasets
2
, Manually crafting decision rules has two major disadvantages:
1. The required logic is specific to a single domain and task.
Slight changes in the task can require a rewrite of the whole system.
2. Designing rules requires a deep understanding of how a decision should be made by a human expert.
Computers and humans resolve problems differently, this can cause issues when making up the rules.
Presenting a lot of data to the computer, after which it can determine the rules by itself, can resolve this issue.
PRE-PROCESSING
DATA SCALING
Machine learning algorithms don’t perform well when the input numerical attributes vary widely in scale.
STANDARD SCALER ROBUST SCALER MIN-MAX SCALER
This is good for non- Less sensitive to skewed data and outliers. Shift the data to an interval set
skewed data The median value is now indicated by 0. by 𝑥𝑚𝑖𝑛 and 𝑥𝑚𝑎𝑥 , usually [0,1]
𝒙−𝝁 𝒙 − 𝒎𝒆𝒅𝒊𝒂𝒏 𝒙 − 𝒙𝒎𝒊𝒏
𝒙=𝒛= 𝒙= 𝒙=
𝝈 𝑰𝑸𝑹 𝒙𝒎𝒂𝒙 − 𝒙𝒎𝒊𝒏
NORMALIZER
Rows are rescales such that the norm is 1. This is useful when just the direction of data matters.
Compute norm ξ𝒆𝒍𝒆𝒎𝒆𝒏𝒕𝒔𝟐 and divide each element by this norm.
Univariate transformations Most models perform best with Gaussian distributed data.
Methods to transform data to Gaussian include Box-Cox and Yeo-Johnson.
Both estimate the best power transformation to get a Gaussian distribution.
Yeo-Johnson can work with negative numbers but is less interpretable.
Binning Separate the values into 𝒏 categories. All values within one category are replaced
by e.g. the mean. This is effective for models with few parameters (e.g. regression),
but not for models with many parameters (e.g. decision trees)
MISSING VALUE IMPUTATION
There are different methods for missing value imputation, the best fit depends on the situation and the data.
Missing value imputation Pre-processing focussed on missing values. Missing data is common in the real world.
Imputation replaces the missing value with an estimate for that value.
Common ways are: Mean/median, KNN, model-driven, or iterative
Mean imputation The mean value for a column is taken. This is not very precise
KNN imputation The mean of the K nearest neighbors in the remaining columns is taken. (more flexible)
Model-driven imputation A regression model predicts which value is expected given the values that are known
CATEGORICAL VARIABLES
Data regularly has categorical/discrete features. It is often necessary to represent these as numbers.
There are two main methods:
1. One Hot encoding
Each category of the initial feature becomes its own dummy feature.
This is popular when there are few categories in the feature. This does not imply order within the feature
2. Count-based encoding
For high cardinality (amount) categorical features. A label aggregates the value of the variable.
TUNING PARAMETERS
This is never done on the test set.
Use the training set to estimate the coefficients for different values of hyperparameters.
Use the validation set to estimate the best degree of the polynomial, by evaluating the fit on the second set.
Use the test set to test how well it generalizes to unseen data.
3