Garantie de satisfaction à 100% Disponible immédiatement après paiement En ligne et en PDF Tu n'es attaché à rien
logo-home
Samenvatting An Introduction to Statistical Learning - Machine Learning (F000942) €6,49
Ajouter au panier

Resume

Samenvatting An Introduction to Statistical Learning - Machine Learning (F000942)

 29 vues  0 fois vendu

Samenvatting van het vak Machine Learning aan UGent, gegeven door Dries Benoit. Voor richting Data Science for Business en Handelsingenieur, eerste master

Aperçu 4 sur 35  pages

  • Non
  • Inconnu
  • 11 janvier 2024
  • 35
  • 2023/2024
  • Resume
book image

Titre de l’ouvrage:

Auteur(s):

  • Édition:
  • ISBN:
  • Édition:
Tous les documents sur ce sujet (1)
avatar-seller
LLEO
MACHINE LEARNING
INTRODUCTION: RECAP TO DATA MINING
Machine Learning is the study of computer algorithms that can improve automatically through
experience and by the use of data. It is seen as a part of artificial intelligence (AI).

Data Mining is an interdisciplinary subfield of computer science and statistics with an overall goal to
extract information from a data set and transform the information into a comprehensible structure
for further use.

Regression Models

Linear Regression

A Simple Linear Regression (SLR) is a statistical method that allows us to summarize and study the
relationship between two continuous, quantitative, variables. There is one dependent variable Y
(output, response, outcome…) and one independent variable X (input, feature, predictor…).

Y = β0 + β 1 X

β1 is the slope of the line, representing the change in Y or a one-unit change in X. β0 is the Y-
intercept, representing the value of Y when X is 0. These β parameters are estimated using the least
squares method.

Residuals in SLR are the differences between the observed (actual) values of the dependent variable
and the values predicted by the regression model:

e i= y i− ŷi

Visually, the residuals represent the vertical distance between observation and regression line.

The Residual Sum of Squares (RSS) is a measure of the total error or total deviation of the observed
vales from the values predicted by the regression line.
n
RSS=∑ e i
2

i=1

Residuals are used to evaluate how well the regression model fits the data. A good model will have
residuals close to zero. The regression model is typically fitted by minimizing the RSS. In other words,
the β parameters are chosen in such a way that the sum of the squares residuals (RSS) is minimized.
This method is known as the least squares method. A lower RSS is good.

Assumptions:

- SLR assumes a linear relationship between the variables
- SLR assumes that the residuals are normally distributed and have constant variance.

Multiple Linear Regression

Is an extension of the SLR that involves two or more independent variables to predict a single
response variable. ε is the error term, representing the unobserved factors that affect Y but are not
included in the model.

Y = β0 + β 1 X 1 + β 2 X 2 +…+ β p X p + ε

,The goal is to estimate the β coefficients that minimize the sum of squared differences between the
observed and predicted values of Y (= RSS). The assumptions that held in SLR still apply here. Instead
of generating a regression line, like in SLR, the MLR fits a hyperplane.




Overfitting might be a problem
here. This happens when you include too many predictors without sufficient data, where the model
then fits the training data closely, but fails to predict well to new data.




Difference between confidence and prediction interval?

Confidence and prediction intervals are both statistical concepts used in regression analysis to
provide a range within which a parameter or a future observation is expected to fall.

A confidence interval is used to estimate the range in which we expect the population parameter
(regression coefficient) to fall with a certain level of confidence.

A prediction interval is used to estimate the range within which a future observation (new data
point) is expected to fall.

A confidence interval is usually more narrow than a prediction interval, since the confidence interval
focuses on an entire population whereas the prediction interval only focusses on an individual point.
The uncertainty is bigger in the prediction interval.

Logistic Regression

A statistical method used for modelling the probability of a binary outcome. Despite its name , it is
commonly used for classification problems where the dependent variable has two outcomes
(=dichotomous). It transforms values between -infinity and +infinity to values between 0 and 1.

The logistic function is used to model the probability that a given input belongs to a particular
category.




The outcome of this function is a probability between 0 and 1. To make a binary decision, a threshold
is chosen (commonly 0.5) and if the predicted probability is above this threshold, the observation is
classified as the positive (1) or the negative (0) class.

,The log of odds (=logit) function is often used to interpret the results. This equation linearly combines
the input features, and the parameters represent the change in the log-odds for a one-unit change in
the corresponding feature.



Logit is the ln of the
probability of event happening / probability of event not happening.

Assumptions:

- Assumes the relationship between the independent variables and the logit of the dependent
variable is linear.

The β parameters are found by maximizing the likelihood function. It measures the likelihood of
observing the given set of outcomes given a specific set of parameter values (β). The goal is to find
the set of parameter values that maximize this likelihood. (SLIDE 25)




Linear Discriminant Analysis (LDA)

LDA is a dimensionality reduction and classification technique commonly used in the context of
supervised learning. Its primary objective is to find a linear combination of features that characterize
or separates two ore more classes in the data.




The figure
shows two classes with one predictor. The classes being a house-owner (pink) and a non-house-
owner (green). The predictor (x-axis) here is the amount of money either of the classes have saved or
is on their bank account. When a new data point has to be classified, its class is determined based on
the nearest class mean.

Assumptions:

- LDA assumes the features are normally distributed with each class
- LDA assumes that the classes have the same covariance (= the same SD)

, Main difference with logistic regression is that logistic regression does not make assumptions about
the distribution of x’s. Also, logistic regression focusses on predicting probabilities while LDA focusses
on maximizing the class separability.

Confusion Matrix

A table used in classification to evaluate the performance of a machine learning model. Provides a
summary of the predicted and actual classifications for a given set of data.

True Positive True Negative

Predicted Positive TP FP

Predicted Negative FN TN



- TP; True Positive
- FP; False Positive
- FN; False Negative
- TN; True Negative

Metrics derived from the confusion matrix:

- Accuracy: (TP + TN) / N
- Error: (FP + FN) / N
- Specificity (=TNR): TN / (TN + FP)
- Precision: TP / (TP + FP)
- Sensitivity (=TPR or Recall): TP / (TP + FN)
- FPR: FP / (FP + TN)

Receiver Operating Characteristics (ROC) Curve

ROC curve is a graphical representation that illustrates the performance of a binary classification
model at various classification thresholds. Tool for evaluating the trade-off between True Positive
Rate (TPR) and False Positive Rate (FPR). The threshold varies from 0 to 1. At each threshold, calculate
the TPR and FPR. Plot these pairs on the ROC curve.




This plot is summarized using the AUC-ROC metric.
It is a single value that summarizes how well a binary classification model distinguishes between the
two classes. A higher value is a better performance. 1 is perfect discrimination, 0.5 suggests random
performance.

Les avantages d'acheter des résumés chez Stuvia:

Qualité garantie par les avis des clients

Qualité garantie par les avis des clients

Les clients de Stuvia ont évalués plus de 700 000 résumés. C'est comme ça que vous savez que vous achetez les meilleurs documents.

L’achat facile et rapide

L’achat facile et rapide

Vous pouvez payer rapidement avec iDeal, carte de crédit ou Stuvia-crédit pour les résumés. Il n'y a pas d'adhésion nécessaire.

Focus sur l’essentiel

Focus sur l’essentiel

Vos camarades écrivent eux-mêmes les notes d’étude, c’est pourquoi les documents sont toujours fiables et à jour. Cela garantit que vous arrivez rapidement au coeur du matériel.

Foire aux questions

Qu'est-ce que j'obtiens en achetant ce document ?

Vous obtenez un PDF, disponible immédiatement après votre achat. Le document acheté est accessible à tout moment, n'importe où et indéfiniment via votre profil.

Garantie de remboursement : comment ça marche ?

Notre garantie de satisfaction garantit que vous trouverez toujours un document d'étude qui vous convient. Vous remplissez un formulaire et notre équipe du service client s'occupe du reste.

Auprès de qui est-ce que j'achète ce résumé ?

Stuvia est une place de marché. Alors, vous n'achetez donc pas ce document chez nous, mais auprès du vendeur LLEO. Stuvia facilite les paiements au vendeur.

Est-ce que j'aurai un abonnement?

Non, vous n'achetez ce résumé que pour €6,49. Vous n'êtes lié à rien après votre achat.

Peut-on faire confiance à Stuvia ?

4.6 étoiles sur Google & Trustpilot (+1000 avis)

53340 résumés ont été vendus ces 30 derniers jours

Fondée en 2010, la référence pour acheter des résumés depuis déjà 14 ans

Commencez à vendre!
€6,49
  • (0)
Ajouter au panier
Ajouté