Lecture 14 - Supervised learning: model evaluation
Example:
Despite only a limited number of data points being available, the more complex models still
tried to fit the unintended patterns into the data, resulting in overfitting and potentially poor
predictions for new data points.
Bias: In the context of model evaluation, bias refers to the error stemming from erroneous
assumptions in the learning algorithm that restricts it from accurately capturing the
underlying patterns in the data. A model with high bias pays little attention to the training
data and oversimplifies the underlying patterns, leading to underfitting. This results in
consistently inaccurate predictions, even when trained on different sets of data. A common
symptom of bias is poor performance on both the training set and the testing set.
Variance: Variance, on the other hand, refers to the error due to the model's sensitivity to
small fluctuations or noise in the training dataset. A high variance model, often seen in
overfitting scenarios, performs exceptionally well on the training data but poorly on the
testing data. This indicates that the model is capturing noise and random fluctuations rather
than the underlying true patterns. The model fails to generalize to new data points, causing
large fluctuations in the predicted values.
An unbiased model gives the correct prediction, on average over samples from the target
population. High bias models typically perform poorly on both training and testing data. They
are unable to capture the underlying patterns, resulting in systematic errors.
High variance models tend to perform well on training data but poorly on testing data,
indicating an overfitted model that fails to generalize.
,Bias-Variance Tradeoff:
The bias-variance tradeoff occurs because as model complexity* increases, the model tends
to capture more detailed patterns in the data, leading to a reduction in bias. However, this
often results in an increase in variance as the model starts fitting to noise or irrelevant
patterns present in the training data. Finding the right balance between bias and variance is
an important aspect of building models in machine learning.
*What is model complexity? --> In this context, complexity refers to how much information
in the data is absorbed into the model or how much compression is performed on the data
by the model. It also refers to the number of effective parameters relative to the effective
degrees of freedom in the data.
1. Does the bias-variance tradeoff occur with n = 5? With a small dataset size, the bias-
variance tradeoff may not be as pronounced. In such cases, the model might not have
enough data to capture the underlying patterns accurately, leading to both high bias
and high variance.
2. Does the bias-variance tradeoff occur with n = 5,000,000,000? With a significantly
large dataset size, the bias-variance tradeoff might be less pronounced. The large
dataset size provides the model with sufficient information to learn the underlying
patterns accurately, reducing both bias and variance.
Population mean squared error = squared bias PLUS model variance PLUS irreducible
variance.
> The bias is squared in the context of the bias-variance tradeoff and the calculation of
the expected mean squared error (MSE) because both the model variance and
irreducible variance are also squared. This squaring allows for a direct comparison
and combination of the bias with the variance terms in the context of evaluating the
overall error in the model's predictions.)
> The E means “on average over samples from the target population.
The train-val-test paradigm (the train-validation-test split)
1. Training data: refers to the set of observations that are used to train, fit, or estimate
the parameters of a machine learning model, denoted as 𝑓′(𝑥). These data points are
fed into the model during the learning phase, allowing the model to learn the
underlying patterns and relationships present in the data.
2. Validation Data (or "Dev" Data): also known as development data, is a separate
dataset that is used during the model development phase to fine-tune the model's
hyperparameters and assess its performance. This dataset consists of new
observations from the same source as the training data, but it is not used during the
, model training process. Instead, it is employed multiple times to select the optimal
model complexity, hyperparameters, or other settings that lead to improved model
performance.
3. Test Data: The test data is an independent dataset that the model has never
encountered during the training or validation phase. It serves as a final checkpoint to
evaluate the model's performance and generalization ability on completely unseen
data.
The average squared error in the test set, denoted as MSEtest, is often considered a good
estimate of the "Bayes error," denoted as E(MSE). The Bayes error represents the lowest
possible error that could be achieved for a given problem, assuming that the model perfectly
captures the underlying data distribution.
Drawbacks of train/dev/test:
• the validation estimate of the test error can be highly variable, depending on
precisely which observations are included in the training set and which observations
are included in the validation set.
• In the validation approach, only a subset of the observations — those that are
included in the training set rather than in the validation set — are used to fit the
model.
• This suggests that the validation set error may tend to overestimate the test error for
the model fit on the entire data set.
> This is why we use…. cross-validation!
recur
Cross-validation serves as an alternative to the single development set (dev set) approach.
Instead of having a single fixed validation set, the cross-validation method performs the
train/dev split multiple times, allowing for a more comprehensive assessment of the model's
performance across different subsets of the data.
With K-fold cross-validation, the dataset is divided into K subsets, with each subset used
once as the validation set while the remaining K-1 subsets are used as the training set. This
, process is repeated K times, with each of the K subsets used exactly once as the validation
data. The results from each iteration are then averaged to provide an overall performance
estimate.
• When K = n, “leave-one-out”;
• Usually K = 5 or K = 10
Common task framework (CTF)
a.k.a. “benchmarking”
(a) A publicly available training dataset
(b) A set of enrolled competitors whose common task is to infer a class prediction rule from
the training data.
(c) A scoring referee, to which competitors can submit their prediction rule.
• The referee runs the prediction rule against a testing dataset, which is sequestered
behind a Chinese wall.
• The referee objectively and automatically reports the score achieved by the
submitted rule.
In short, benchmark is a type of model used to compare performance of other models.
Advantages
1. Error rates decline by a fixed percentage each year, to an asymptote depending on task
and data quality.
2. Progress usually comes from many small improvements; a change of 1% can be a reason to
break out the champagne.
3. Shared data plays a crucial role—and is reused in unexpected ways.
Kaggle.com is a great example of CTF because their entire business model is to host
competitions on who can get a better predictive model with a set deadline.