Samenvatting

Top samenvatting datamining

0 keer verkocht

Vak
Statistisch modelleren en datamining

Instelling
Universiteit Gent (UGent)

Het leren van deze samenvatting zorgt gegarandeerd tot slagen. Ik zelf had een 15/20.

[Meer zien]

Voorbeeld 4 van de 57 pagina's

Bekijk voorbeeld

Geupload op 10 oktober 2024
Aantal pagina's 57
Geschreven in 2022/2023
Type Samenvatting

datamining

Instelling
Universiteit Gent (UGent)
Studie
Handelsingenieur
Vak
Statistisch modelleren en datamining

$13.10

In winkelwagen

Op verlanglijstje

100% tevredenheidsgarantie
Direct beschikbaar na je betaling
Lees online óf als PDF
Geen vaste maandelijkse kosten

Boek:
ISLRv2_website.pdf (su.domains)
Chapter 1: Introduction
Supervised learning = building a statistical model for predicting an output based on one or more
inputs
Regression = predicting a continuous or quantitative output (price,..)
Classification = predicting a qualitative output (gender, up/down,..)
Unsupervised learning = the inputs are not supervising the outputs
- No outcome variable, just a set of predictors/features measured on a set of samples.
- Objective is more fuzzy: find groups of samples that behave similarly, find features that
behave similarly, find linear combinations of features with the most variation, . . .
- It’s difficult to know how well you are doing.
- Different from supervised learning, but can be useful as a pre-processing step for supervised
learning.
- we lack a response vari- able that can supervise our analysis

Clustering = - grouping individuals according to observed characteristics : here we are not trying to
predict an output variable
Association = determining rules that describe large portions of a dataset

ISL (= introduction to Statistical learning) based on 4 premesis
- Many statistical learning methods are relevant and useful in a wide range of academic and
non-academic disciplines, beyond just the statistical sciences
- Statistical learning should not be viewed as a series of black boxes : no single approach will
perform well in all possible applications
- While it is important to know what job is performed by each cog, it is not necessary to have
the skills to construct the machine inside the box
- We presume that the reader is interested in applying statistical learning methods to real-
world problems

Chapter 2: Statistical learning
= set of tools for making sense of complex datasets
X = input/predictor/independent variable
Y = output/response/dependent variable

f represents the systematic information that X provides about Y à statistical learning refers to a set
of approaches for estimating f
- e captures measurement errors = random error term, which is independent of X and has
mean zero

Why estimate f?
- prediction
- inference

1. prediction
𝑌"= 𝑓$(𝑋) à error term averages to zero
- 𝑓$= estimate for f
- 𝑌" = resulting prediction for Y à often treated as a black box = one is not typically concerned
with the exact term of 𝑓$ , provided that it yields accurate predictions for Y.

,Ideal predictor of Y: mean-squared prediction error: is the function that
minimizes over all functions g(.) at all points X = x

The accuracy of 𝑌" as a prediction for Y depends on 2 quantities
- reducible error = we can potentially improve the accuracy of 𝑓$ by using the most appropriate
statistical learning technique to estimate f
- irreducible error = no matter how well we estimate f, we cannot reduce the error introduced
by ε (bc Y is also a function of ε wich cannot be predicted using X.
o The quantity ε may contain unmeasured variables that are useful in predicting Y: and if
they are not measured or unmeasurable, they can’t be used in the prediction
o Expected value:
o Goal: minimize the reducible error
! irreducible error will always provide an
upper bound on the accuracy of our
prediction for Y

Proof: decompose expected squared error

Expected value is 0
(2nd)

2. Inference
Understand the relationship between X and Y: In this situation we wish to estimate f, but our goal is
not necessarily to make predictions for Y à 𝑓$ cannot be treated as a black box: we need to know the
exact form
- which predectors are associated with the response variable
o identifying the important predictors
- what is the relationship between the predictor and the response
o positive or negative relationship
- what type of model best explains the relationship?

How do we estimate f?
Models of estimating f
- parametric
- non-parametric

training data = n different data points/ observations that we want to fit in our model
ð goal = apply a statistical learning method to the training data in order to estimate the
unknow function f

,parametric
reduces the problem of estimating f down to one of estimating a set of parameters because it
assumes a form for f => it simplifies the problem
1. Make an assumption about the function form of f (bv linear: p+1 parameters)
2. After selecting a model, use training data to fit or train the model (bv least squares)

parametric and structured models: the lineal model is important:
- specified in terms of p+1 parameters: {β0, β1, β2, ... , βp }
- estimate parameters by fitting the model to training data
- almost never correct but serves good and interpretable approximation to unknown true
function à good to see interference

disadvantages: the model we choose will usually not match the true unknown form of f
è choose more flexible modes: estimate a greater number of parameters
è potential to inaccurately estimate f if the form of f assumed is wrong
è more complex model à overfitting: they follow the errors, or noise, to closely
advantages: more interpretable (easier to explain the results)

non-parametric
does not make an explicit assumption on the functional form of f à attempt to get as close to the
data points as possible, without being too rough or too smooth
advantage: has the potential of fitting in a wider range of possible shapes of f
disadvantage: does not reduce the problem, so a larger number of observations is needed for an
accurate estimate of f

non-parametric model:
thin-plate spline: technique that does not impose any pre-specified model on f. It instead attempts
to produce an estimate for f that is as close as possible to the observed data
- importance of level of smoothness

Trade-offs
Restrictive > flexible
- for interference: more interpretable
Flexible > restrictive
- predictions: interpretability not of interest
- wider range of possible shapes
Prediction accuracy vs interpretability
- lin models are easy to interpret
- thin-plate spines not
Good fit vs over-fit or under-fit
Parsimony vs black-box
- prefer simpler model involving fewer
variables over a black-box predictor
involving them all if they have the same
result
The more performant à the less interpretive it becomes

Supervised vs unsupervised learning
We can seek to understand the relationships between the variables between the observations
- using cluster analysis or clustering: look whether observations fall into distinct groups
- sometimes difficulty as variables can’t be put easily in groups because they overlap

, Regression vs classification problems
regression problems: with quantitative data
- use of least squares
- use of K-nearest-neighbors
classification problems: with qualitative data
- use of logistic regression: binary
- use of K-nearest-neighbors

Assessing model accuracy
No best method for every data set à selecting the best approach is therefore very important

Measuring the quality of Fit
Mean squared error = how well its predicted value for a given observation is close to the true response
value for that observation à does it match the observed data?

MSE= small if the predicted responses are very close to the true responses
MSE= large if for some observations, the predicted and true responses differ substantially
- we are interested in the accuracy of the predictions that we obtain when we apply our
method to previously unseen test data à not in the training data

In other words, if we had a large number of test observations we could compute the average squared
prediction error for these observations (x0,y0).
- Select the model for which this is as small as possible
- Fundamental problem: there is no guarantee that the method with the lowest training MSE
will also have the lowest test MSE
o Test MSE often much larger then training MSE

Dit zijn jouw voordelen als je samenvattingen koopt bij Stuvia:

Bewezen kwaliteit door reviews

Studenten hebben al meer dan 850.000 samenvattingen beoordeeld. Zo weet jij zeker dat je de beste keuze maakt!

In een paar klikken geregeld

Geen gedoe — betaal gewoon eenmalig met iDeal, Bancontact of creditcard en je bent klaar. Geen abonnement nodig.

Focus op de essentie

Studenten maken samenvattingen voor studenten. Dat betekent: actuele inhoud waar jij écht wat aan hebt. Geen overbodige details!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper MarieVerhelst60. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor $13.10. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 69252 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Start met verkopen

Laatst bekeken door jou

Tentamen (uitwerkingen) ·

(0)

Final Exam: CMN568 / CMN 568 (Latest 2024 / 2025) Intro to Family NP Final Exam UNITS 1 - 5 | 100% Correct | Questions and Verified Answers | Grade A - South Alabama