Grasple lessons ARMS 2020-2021
Grasple lesson 1 – Introduction
Week 1
Simple linear regression: there is only one independent (predictor) variable
in the model
Correlation coefficient: standardized number that assesses the strength of
a linear relationship
o An absolute value of 1 indicates maximum strength of a relation
between two variables
o A value of 0 indicates no linear relationship between the two
variables
o It is a standardized measure
o The correlation does not mean that the movement in one variable
causes the other variable to move as well
o A high positive correlation means that when one variable increases,
the other one also increases
o A high negative correlation means that when one variable increases,
the other one decreases
Pearson’s r: allows you to compare correlations, because it is always
between -1 and 1
In a non-linear relation you cannot calculate Pearson’s r
A variable has to be measured at interval/ratio level to calculate
correlations
First you draw a scatterplot, which provides valuable information about the
strength and the direction of the relationship
An experiment is required to establish that there is a cause-effect
relationship, so that other explanations can be ruled out
In essence, linear regression boils down to summarizing a bunch of data by
drawing a straight line through them
o We use linear regression to make predictions about linear relations
o The straight line is used to predict the value of one variable based
on the value of the other variable
Slope: if X increases by one unit, how much does Y increase?
Intercept: the point where the regression line crosses the y-axis
Y value: intercept + slope x X-value
o The hat on the y is used to denote that this is not the observed y-
score but the predicted y-score
o B0 = intercept
o B1 = slope
In a lot of occasions, the intercept by itself can be fairly meaningless and
only serves (mathematically) to support a correct prediction
Linear regression is an analysis in which you attempt to summarize a
bunch of data points by drawing a straight line through them
The distance between the true value y and the predicted value y is called
the error/residual
, Positive and negative errors cancel each other. The sum of all errors is
always zero
o When we square the errors, they will always be positive, and they do
not cancel each other —> this way we can look for a line that will
result in the smallest possible sum of squared errors —> the least
squares method v
With the least squares method we can find a linear regression
model which fits the data best
The following formulate determines the slope of the line with the smallest
sum of squared errors:
The slope equals the correlation coefficient (pearson’s r) times the
standard deviation of y divided by the standard deviation of x —> you do
not need to be able to compute the best fitting linear regression model (B0
& B1), SPSS does this for you
o In the output, the slope is the regression coefficient of the variable
o The intercept is what SPSS calls the constant
Goodness of fit: assesses how well the fit of the prediction is —> an
example is the R-squared number
R-squared: determines the proportion of the variance of the response
variable that is explained by the predictor variable(s)
o It is a proportion between 0 and 1
o If the R-squared is very small, this does NOT mean that there is NO
meaningful relationship between the two variables —> it could still
be practically relevant even though it does not explain a large
amount of variance
,Grasple lesson 2 - Multiple linear regression
Week 1
Assumptions (Initial)
Assumption 1: variables have to be continuous or dichotomous
Assumption 2: relations have to be linear
Assumption 3: there has to be an absence of outliers
The influence of a violated model assumption on the results can be severe,
therefore it is important to visualize your data
Assumptions (statistical)
Absence of outliers: click on Save in SPSS and check; standardized
residuals, Mahalanobis distance and cook’s distance
Absence of multicollinearity: click on statistics and check: collinearity
diagnostics
Homoscedasticity: click on plots, place the variable *ZPRED (The
standardized predicted valueS) on the X-axis and the variable *ZRESID (the
standardized residuals) on the Y-axis
Normally distributed residuals: click on plots and check histogram
Absence of outliers; Look at the residual statistics table and view the
minimum & maximum values of the standardized residuals/mahalanobis
distance/cook’s distance
o Standardized residuals: Checks for outliers in the Y-space: values
must be between -3.3 and +3.3, otherwise they indicate outliers
o Mahalanobis distance: checks whether there are outliers in the X-
space —> extreme score on a predictor or combination of
predictors. Must be lower tan 10+2x(number of independent
variables)
o Cook’s distance: checks whether there are outliers in the XY-space:
extreme combination of X and Y scores —> indicates the overall
influence of a respondent on a model. Must be lower than 1
Higher cases: indicate influential respondents (influential
cases)
When you have to make a choice about whether or not to remove an
outlier, a number of things are important:
o Does this participant belong to the group about what you want to
make inferences about? If not, do not include the participant in the
analyses
o Is the extreme value of the participant theoretically possible? If not,
do not include the participant in the analysis. If so, run the analysis
with and without the participant, report the results of both analyses
and discuss any differences.
The coefficients table contains information on multicollinearity in the last
columns: this indicates whether the relation between two or more
independent variables is too strong (r >0.8).
o These two variables are most likely interrelated
If you include overly related variables in your model, this has 3
consequences:
o The regression coefficients (B) are unreliable
o It limits the magnitude of R (correlation between Y and Y-hat)
o The importance of individual independent variables can hardly be
determined, if at all
, SO: you DON’T want multicollinearity: perfect multicollinearity means that
your independent variables are perfectly correlated
Rule of thumb: values for the Tolerance smaller than 0.2 indicate a
potential problem, smaller than 0.1 indicate a problem
o The variance inflation factor (VIF) is equal to 1/Tolerance. So for the
VIF; values greater than 10 indicate a problem.
You can find VIF and Tolerance in the last two columns in the coefficients
table
Homoscedasticity: means that the spread of residuals for an X value must
be approximately the same across all points. We assess this by plotting the
standardized residuals against the standardized predicted values
o If for every predicted value (X-axis) there is approximately the same
amount of spread across the Y-axis, then the condition is met
Normally distributed residuals: although a histogram does not exactly
follow the line of normal distribution, the deviations are not great enough
that we would conclude that the condition for normally distributed
residuals has been violated
Performing and interpreting MLR
If all the assumptions are met, the regression model can be interpreted
Multiple correlation coefficient R: this value indicates the correlation
between the observed satisfaction scores (Y) and the predicted satisfaction
scores (Y-hat)
o It is used to say something about how good the model is at
predicting satisfaction (in this case!!)
R-squared: normally assesses how much variance of the dependent
variable is explained by the model
o Refers to the proportion of explained variance in the sample
Adjusted R-squared: is an estimate of the proportion of explained variance
in the population. It adjusts the value of -R-squared on the basis of the
sample size n and the number of predictors in the model k
o The estimated proportion of explained variance in the population is
always somewhat lower in the proportion of explained variance in
the sample
F-test: considers whether the model as a whole is significant
o Here we look at whether the three independent variables together
can explain a significant part of the variance
In the ANOVA-table, we only look at whether the models on themselves are
significant.