College aantekeningen

Very complete lecture Notes Data Wrangling and Data Analysis (INFOMWDR)

33 keer bekeken 1 keer verkocht

Instelling
Universiteit Utrecht (UU)

This document contains a full elaboration of the second part lectures of the course Data Wrangling & Data Analysis, part of the Master Applied Data Science (UU). I have worked out the lectures in such a way as to make it easier to understand and added useful information to help you make sense of th...

[Meer zien]

Voorbeeld 4 van de 60 pagina's

Bekijk voorbeeld

Geupload op 23 november 2023
Aantal pagina's 60
Geschreven in 2023/2024
Type College aantekeningen
Docent(en) Hakim qahtan, daniel oberski
Bevat 14 t/m 25 (second half)

deep learning
missing data
clustering
model based clustering
text mining
time series
data streams
algorithmic fairness
supervised learning model evaluation
supervised learning classification

€5,99

In winkelwagen

Op verlanglijstje

100% tevredenheidsgarantie
Direct beschikbaar na betaling
Zowel online als in PDF
Je zit nergens aan vast

Lecture 14 - Supervised learning: model evaluation

Example:

Despite only a limited number of data points being available, the more complex models still
tried to fit the unintended patterns into the data, resulting in overfitting and potentially poor
predictions for new data points.

Bias: In the context of model evaluation, bias refers to the error stemming from erroneous
assumptions in the learning algorithm that restricts it from accurately capturing the
underlying patterns in the data. A model with high bias pays little attention to the training
data and oversimplifies the underlying patterns, leading to underfitting. This results in
consistently inaccurate predictions, even when trained on different sets of data. A common
symptom of bias is poor performance on both the training set and the testing set.

Variance: Variance, on the other hand, refers to the error due to the model's sensitivity to
small fluctuations or noise in the training dataset. A high variance model, often seen in
overfitting scenarios, performs exceptionally well on the training data but poorly on the
testing data. This indicates that the model is capturing noise and random fluctuations rather
than the underlying true patterns. The model fails to generalize to new data points, causing
large fluctuations in the predicted values.

An unbiased model gives the correct prediction, on average over samples from the target
population. High bias models typically perform poorly on both training and testing data. They
are unable to capture the underlying patterns, resulting in systematic errors.

High variance models tend to perform well on training data but poorly on testing data,
indicating an overfitted model that fails to generalize.

,Bias-Variance Tradeoff:
The bias-variance tradeoff occurs because as model complexity* increases, the model tends
to capture more detailed patterns in the data, leading to a reduction in bias. However, this
often results in an increase in variance as the model starts fitting to noise or irrelevant
patterns present in the training data. Finding the right balance between bias and variance is
an important aspect of building models in machine learning.

*What is model complexity? --> In this context, complexity refers to how much information
in the data is absorbed into the model or how much compression is performed on the data
by the model. It also refers to the number of effective parameters relative to the effective
degrees of freedom in the data.

1. Does the bias-variance tradeoff occur with n = 5? With a small dataset size, the bias-
variance tradeoff may not be as pronounced. In such cases, the model might not have
enough data to capture the underlying patterns accurately, leading to both high bias
and high variance.

2. Does the bias-variance tradeoff occur with n = 5,000,000,000? With a significantly
large dataset size, the bias-variance tradeoff might be less pronounced. The large
dataset size provides the model with sufficient information to learn the underlying
patterns accurately, reducing both bias and variance.

Population mean squared error = squared bias PLUS model variance PLUS irreducible
variance.

> The bias is squared in the context of the bias-variance tradeoff and the calculation of
the expected mean squared error (MSE) because both the model variance and
irreducible variance are also squared. This squaring allows for a direct comparison
and combination of the bias with the variance terms in the context of evaluating the
overall error in the model's predictions.)
> The E means “on average over samples from the target population.

The train-val-test paradigm (the train-validation-test split)

1. Training data: refers to the set of observations that are used to train, fit, or estimate
the parameters of a machine learning model, denoted as 𝑓′(𝑥). These data points are
fed into the model during the learning phase, allowing the model to learn the
underlying patterns and relationships present in the data.

2. Validation Data (or "Dev" Data): also known as development data, is a separate
dataset that is used during the model development phase to fine-tune the model's
hyperparameters and assess its performance. This dataset consists of new
observations from the same source as the training data, but it is not used during the

, model training process. Instead, it is employed multiple times to select the optimal
model complexity, hyperparameters, or other settings that lead to improved model
performance.

3. Test Data: The test data is an independent dataset that the model has never
encountered during the training or validation phase. It serves as a final checkpoint to
evaluate the model's performance and generalization ability on completely unseen
data.

The average squared error in the test set, denoted as MSEtest, is often considered a good
estimate of the "Bayes error," denoted as E(MSE). The Bayes error represents the lowest
possible error that could be achieved for a given problem, assuming that the model perfectly
captures the underlying data distribution.

Drawbacks of train/dev/test:
• the validation estimate of the test error can be highly variable, depending on
precisely which observations are included in the training set and which observations
are included in the validation set.
• In the validation approach, only a subset of the observations — those that are
included in the training set rather than in the validation set — are used to fit the
model.
• This suggests that the validation set error may tend to overestimate the test error for
the model fit on the entire data set.

> This is why we use…. cross-validation!
recur
Cross-validation serves as an alternative to the single development set (dev set) approach.
Instead of having a single fixed validation set, the cross-validation method performs the
train/dev split multiple times, allowing for a more comprehensive assessment of the model's
performance across different subsets of the data.

With K-fold cross-validation, the dataset is divided into K subsets, with each subset used
once as the validation set while the remaining K-1 subsets are used as the training set. This

, process is repeated K times, with each of the K subsets used exactly once as the validation
data. The results from each iteration are then averaged to provide an overall performance
estimate.

• When K = n, “leave-one-out”;
• Usually K = 5 or K = 10

Common task framework (CTF)
a.k.a. “benchmarking”

(a) A publicly available training dataset
(b) A set of enrolled competitors whose common task is to infer a class prediction rule from
the training data.
(c) A scoring referee, to which competitors can submit their prediction rule.
• The referee runs the prediction rule against a testing dataset, which is sequestered
behind a Chinese wall.
• The referee objectively and automatically reports the score achieved by the
submitted rule.

In short, benchmark is a type of model used to compare performance of other models.

Advantages
1. Error rates decline by a fixed percentage each year, to an asymptote depending on task
and data quality.
2. Progress usually comes from many small improvements; a change of 1% can be a reason to
break out the champagne.
3. Shared data plays a crucial role—and is reused in unexpected ways.

Kaggle.com is a great example of CTF because their entire business model is to host
competitions on who can get a better predictive model with a set deadline.

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper jitskelanser1. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €5,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 51662 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Start met verkopen

Populaire Universiteiten

Populaire Hogescholen

Populaire Scholen

Populaire samengevatte studieboeken voor Communicatie en Taal

Populaire samengevatte studieboeken voor Economie en Bedrijf

Populaire samengevatte studieboeken voor Exact en Informatica

Populaire samengevatte studieboeken voor Gedrag en Maatschappij

Populaire samengevatte studieboeken voor Gezondheid en Geneeskunde

Populaire samengevatte studieboeken voor Onderwijs en Opvoeding

Populaire samengevatte studieboeken voor Recht en Bestuur

De beste samenvattingen om je Wft-diploma te behalen

De beste samenvattingen om je theorie examens te behalen

De beste samenvattingen voor je cursus in de Veiligheidsbranche

De beste samenvattingen voor Gezondheid & Hygiëne cursussen

De beste samenvattingen voor zakelijke cursussen

De beste samenvattingen voor je PABO WisCAT cursus

Populaire vakken

Populaire vakken

Populaire vakken

Boekverslagen en samenvattingen

Verkoper

College aantekeningen

Very complete lecture Notes Data Wrangling and Data Analysis (INFOMWDR)

Document informatie

Onderwerpen

Geschreven voor

Verkoper

Voorbeeld van de inhoud

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Snel en makkelijk kopen

Focus op de essentie

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?