Samenvatting van het vak Machine Learning aan UGent, gegeven door Dries Benoit. Voor richting Data Science for Business en Handelsingenieur, eerste master
MACHINE LEARNING
INTRODUCTION: RECAP TO DATA MINING
Machine Learning is the study of computer algorithms that can improve automatically through
experience and by the use of data. It is seen as a part of artificial intelligence (AI).
Data Mining is an interdisciplinary subfield of computer science and statistics with an overall goal to
extract information from a data set and transform the information into a comprehensible structure
for further use.
Regression Models
Linear Regression
A Simple Linear Regression (SLR) is a statistical method that allows us to summarize and study the
relationship between two continuous, quantitative, variables. There is one dependent variable Y
(output, response, outcome…) and one independent variable X (input, feature, predictor…).
Y = β0 + β 1 X
β1 is the slope of the line, representing the change in Y or a one-unit change in X. β0 is the Y-
intercept, representing the value of Y when X is 0. These β parameters are estimated using the least
squares method.
Residuals in SLR are the differences between the observed (actual) values of the dependent variable
and the values predicted by the regression model:
e i= y i− ŷi
Visually, the residuals represent the vertical distance between observation and regression line.
The Residual Sum of Squares (RSS) is a measure of the total error or total deviation of the observed
vales from the values predicted by the regression line.
n
RSS=∑ e i
2
i=1
Residuals are used to evaluate how well the regression model fits the data. A good model will have
residuals close to zero. The regression model is typically fitted by minimizing the RSS. In other words,
the β parameters are chosen in such a way that the sum of the squares residuals (RSS) is minimized.
This method is known as the least squares method. A lower RSS is good.
Assumptions:
- SLR assumes a linear relationship between the variables
- SLR assumes that the residuals are normally distributed and have constant variance.
Multiple Linear Regression
Is an extension of the SLR that involves two or more independent variables to predict a single
response variable. ε is the error term, representing the unobserved factors that affect Y but are not
included in the model.
Y = β0 + β 1 X 1 + β 2 X 2 +…+ β p X p + ε
,The goal is to estimate the β coefficients that minimize the sum of squared differences between the
observed and predicted values of Y (= RSS). The assumptions that held in SLR still apply here. Instead
of generating a regression line, like in SLR, the MLR fits a hyperplane.
Overfitting might be a problem
here. This happens when you include too many predictors without sufficient data, where the model
then fits the training data closely, but fails to predict well to new data.
Difference between confidence and prediction interval?
Confidence and prediction intervals are both statistical concepts used in regression analysis to
provide a range within which a parameter or a future observation is expected to fall.
A confidence interval is used to estimate the range in which we expect the population parameter
(regression coefficient) to fall with a certain level of confidence.
A prediction interval is used to estimate the range within which a future observation (new data
point) is expected to fall.
A confidence interval is usually more narrow than a prediction interval, since the confidence interval
focuses on an entire population whereas the prediction interval only focusses on an individual point.
The uncertainty is bigger in the prediction interval.
Logistic Regression
A statistical method used for modelling the probability of a binary outcome. Despite its name , it is
commonly used for classification problems where the dependent variable has two outcomes
(=dichotomous). It transforms values between -infinity and +infinity to values between 0 and 1.
The logistic function is used to model the probability that a given input belongs to a particular
category.
The outcome of this function is a probability between 0 and 1. To make a binary decision, a threshold
is chosen (commonly 0.5) and if the predicted probability is above this threshold, the observation is
classified as the positive (1) or the negative (0) class.
,The log of odds (=logit) function is often used to interpret the results. This equation linearly combines
the input features, and the parameters represent the change in the log-odds for a one-unit change in
the corresponding feature.
Logit is the ln of the
probability of event happening / probability of event not happening.
Assumptions:
- Assumes the relationship between the independent variables and the logit of the dependent
variable is linear.
The β parameters are found by maximizing the likelihood function. It measures the likelihood of
observing the given set of outcomes given a specific set of parameter values (β). The goal is to find
the set of parameter values that maximize this likelihood. (SLIDE 25)
Linear Discriminant Analysis (LDA)
LDA is a dimensionality reduction and classification technique commonly used in the context of
supervised learning. Its primary objective is to find a linear combination of features that characterize
or separates two ore more classes in the data.
The figure
shows two classes with one predictor. The classes being a house-owner (pink) and a non-house-
owner (green). The predictor (x-axis) here is the amount of money either of the classes have saved or
is on their bank account. When a new data point has to be classified, its class is determined based on
the nearest class mean.
Assumptions:
- LDA assumes the features are normally distributed with each class
- LDA assumes that the classes have the same covariance (= the same SD)
, Main difference with logistic regression is that logistic regression does not make assumptions about
the distribution of x’s. Also, logistic regression focusses on predicting probabilities while LDA focusses
on maximizing the class separability.
Confusion Matrix
A table used in classification to evaluate the performance of a machine learning model. Provides a
summary of the predicted and actual classifications for a given set of data.
ROC curve is a graphical representation that illustrates the performance of a binary classification
model at various classification thresholds. Tool for evaluating the trade-off between True Positive
Rate (TPR) and False Positive Rate (FPR). The threshold varies from 0 to 1. At each threshold, calculate
the TPR and FPR. Plot these pairs on the ROC curve.
This plot is summarized using the AUC-ROC metric.
It is a single value that summarizes how well a binary classification model distinguishes between the
two classes. A higher value is a better performance. 1 is perfect discrimination, 0.5 suggests random
performance.
Voordelen van het kopen van samenvattingen bij Stuvia op een rij:
√ Verzekerd van kwaliteit door reviews
Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!
Snel en makkelijk kopen
Je betaalt supersnel en eenmalig met iDeal, Bancontact of creditcard voor de samenvatting. Zonder lidmaatschap.
Focus op de essentie
Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!
Veelgestelde vragen
Wat krijg ik als ik dit document koop?
Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.
Tevredenheidsgarantie: hoe werkt dat?
Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.
Van wie koop ik deze samenvatting?
Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper LLEO. Stuvia faciliteert de betaling aan de verkoper.
Zit ik meteen vast aan een abonnement?
Nee, je koopt alleen deze samenvatting voor €6,49. Je zit daarna nergens aan vast.