Introduction:
A correlation is a standardized measure of the strength of the linear relationship between
two variables. A correlation is scaled to always be between -1 and 1.
- A high positive correlation means that when one variable increases, the other one
also increases
- A high negative correlation means that when one variable increases, the other one
decreases.
- A correlation of 0 means that when one variable increases, that has no linear
influence on the other variable.
A correlation of 0 does not mean that there is no relationship between the two variables, it
could be a non-linear relationship.
A correlation does not say anything about the causal effects of the variables.
Linear regression is an analysis in which you attempt to summarize a bunch of data points by
drawing a straight line through them. Linear regression requires variables at interval/ratio
level. Linear regression should only be performed on linear relations.
The regression equation can be written as: y = b0 + b1x
- B0 refers to the intercept, the point where the line crosses the y-axis and is
interpreted as: if X is 0, y is b0.
- B1 refers to the slope of the line and is interpreted as: if X increases by 1 unit, y
increases by b1 units (and if b1 is negative read decreases)
A regression line never fits all the data points perfectly. There will be residual error. This
residual error is the difference between the observed score y and predicted score y: y.- y.
The estimated regression model is based on minimizing the sum of the squared errors,
Σ(y−y^)2.
R-squared determines the proportion of the variance of the response variable that is
explained by the predictor variable. R-squared is always between 0 and 1.
- A very large R-squared does not mean that the model is a good predictor for new
observations.
- A very small R-squared does not mean that there is a meaningless relationship
between the variables.
Week 1:
The Bayesian framework is based on the posterior distribution of one or more parameters.
Example: estimating a mean μ representing a grade (scale 0-10).
The information in our dataset provides information about what reasonable values
for μ could be (through what is called the likelihood function).
But also the prior distribution provides information, that is, the knowledge or belief
about μ before we examine our data.
The posterior is a compromise (combination) of the prior and likelihood.
,Bayesian statistics assumes that we know more than just the frequency of an event in a data
set. We have some prior (=existing) knowledge (or beliefs) before we look at our own data. In
a Bayesian analysis, we add this prior knowledge or belief to the analysis. Opinions differ on
whether this is a good thing or not.
In classical / frequentists statistics there is one underlying simple definition: the probability
of an event is assumed to be the frequency with which it occurs.
In Bayesian statistics, we use a different way of looking at probabilities. The foundation of
Bayesian statistics is Bayes’ theorem:
P (A given B) = P (B given A) x P(A) / P(B)
Central in Bayes theorem are conditional probabilities. For example: P(A given B) -> what is
the probability that A will happen or is true, given that we know B has happened or is true?
The Bayesian use of conditional probabilities means that we approach an analysis in a
different way. We integrate previous knowledge and beliefs about the thing we are
interested in and then update our knowledge and beliefs based on the evidence we find in
our data.
The different definition of probability used in the Bayesian framework also implies that the
interpretation of results is somewhat different. And according to Bayesians: the Bayesian
interpretation is more intuitive.
Like the frequentists approach, the Bayesian approach can be used for estimation (estimating
the true value of a parameter) and for hypothesis testing.
A frequentists interval is called a confidence interval. A Bayesian interval is called a credible
interval.
The definition of a (frequentist) p-value is the probability of observing the same or more
extreme data given that the null hypothesis is true – P (data | null hypothesis). But this does
not provide information about how likely it is that the null hypothesis is true given the data –
this would be P (null hypothesis | data).
A Bayesian probability can provide information about this: how likely is the null, or any other
hypothesis, given the data we observed.
Bayesians measure the relative support for hypothesis. Two hypotheses are compared, or
tested against one another, using the Bayes factor (BF).
A BF12 of 10 means that the support for H1 is 10 times stronger than the support for H2.
This does not imply that H1 is an excellent or perfect true hypothesis: it is possible that there
is another hypothesis H3 that would receive much more support than H1.
A BF is not a probability but BFs can be transformed into (relative) probabilities. First we have
the define prior model probabilities: how likely is each hypothesis before seeing the data?
The most common choice is that before seeing data, each hypothesis is considered equally
likely. This provides:
When we are interested in 2 hypothesis H1 and H2: P(H1) = P(H2) = 0.5
When we are interested in 3 hypothesis H1, H2 and H3: P(H1) = P(H2) = P(H3) = 0.333
,The prior probabilities add up to 1 because they are relative probabilities dived over the
hypotheses of interest. This is also the case for unequal prior probabilities that could be
defined just as well: if we are interested in H1 and H2 and we think that H1 is more likely a
priori, we could assign P(H1) = 0.6 and P(H2) = 0.4
The posterior model probabilities (PMP) also add up to one (and they are also relative
probabilities)
Assumption: the dependent variable is a continuous measure (interval or ratio level).
Assumption: the independent variables are continuous or dichotomous (two options of
answers, such as male/female, pass/fail)
Another assumption in MLR is linearity of relations (the L in MLR). Assumption: there are
linear relationships between the dependent variable and each of the continuous
independent variables.
This can be checked using scatterplots. A scatterplot has the (continuous) predictor on the x-
axis and the outcome on the y-axis and uses dots to represent the combination of x-y scores
for each in the data.
A linear relation means that the scores in the scatterplot form a cloud with an oval shape
that can be described reasonably well by a straight line (i.e., not a curved or an s-shaped
relationship).
When a relation between a continuous predictor (x) and the outcome (y) is not linear, you
can add additional terms to the regression model to accommodate the non-linearity.
Assume the relation has one curve. Then a quadratic relation may better present the
observed relation between x and y than the linear relation.
Linear: y = B0 + B1X + e
Quadratic = y = B0 + B1X + B2X2 + e
This is achieved by computing a new variable, the squared version of the original X and
running the regression with both variables X and X2 as predictors. You then get two
parameter estimates, B1 and B2, where:
B1 informs you about the steepness of the overall slope (the linear trend in the
curved relation). The p-value when testing B1 informs you whether the linear trend is
zero (horizontal) or not (when p < .05)
B2 informs you about how curved the relation is, or stated differently, it measures the
change in slope with increasing X. The p-value when testing B2 informs you whether
the change in slope is significantly non-zero. This basically tells you whether the
quadratic relation is a better model for your data than the linear relation.
Assumptions: there are no outliers. An outlier is a case that deviates strongly from another
cases in the data set. This can be on one variable (e.g. everybody in the data has values
between 20-25 on variable x, but one person scored 35 on x), on 2 variables (e.g., one dot in
the scatterplot is far outside the oval cloud that contains the other dots), or on a
combination of even more variables (then numerical inspection should be used instead of
visual inspection, because the outlier will lie in a multi-dimensional space).
Outliers can be a problem because they may indicate that a data point is due to an error (e.g.
the age “8” in a sample of adults or an impossibly short reaction time) and because they can
have an outsized impact on the results.
, It is not always easy to decide how to deal with outliers. There are three general options:
1. Do nothing (include the outlier in the analysis)
2. Exclude the data point (or the entire participant)
3. Change the data point:
a. To the ‘correct’ value (only if the outlier is known to be an error and when the
correct value is known), or
b. To a less extreme value, for example the mean+2+SD (‘winsorizing’). This way
this case still has a large score but not so extreme that it will completely
dominate the results of the analysis.
When an outlier appears to be due to an error, it may seem plausible to exclude or change it.
But still be careful with this: first, often one cannot be completely sure that it is truly an error.
Second, other data points there are not outliers might also be erroneous but would not be
excluded or changed, meaning that this could introduce a bias.
Changing values in a data set should be an absolute exception and only happen for clear,
principled reasons (which must always be documented). Excluding outliers should be
preferred to changing them. In both cases, it is very important to be transparent about any
alterations to the data (and the motivation for doing so).
When an outlier has an outsized impact on the results (regardless of whether it is an error or
not), this is important to know and document.
Absence of outliers can be determined through a scatter plot, histogram, or box plot,
whether there are outliers within in the data for 2 (scatter) or one (histogram, boxplot)
variables at the time.
Multivariate outliers (for all variables in the model) can be assessed whilst performing the
analysis.
Standardized residuals: check whether there are outliers in the Y-space. As a rule of thumb, it
can be assumed that the values must be between -3.3 and +3.3. Those smaller than -3.3, or
greater than +3,3, indicate potential outliers.
Note, a rule of thumb is not a strict rule.
With Cook’s distance, it is possible to check whether there are outliers within the XY-space.
An outlier in the XY-space is an extreme combination of X and Y scores. Cook’s distance
indicates the overall influence of a respondent on the model. As a rule of thumb, we
maintain that values for Cook’s distance must be lower than 1. Values higher than 1 indicate
influential respondents (influential cases).
When you have to make a choice about whether or not to remove an outlier, it can be
helpful to ask if the extreme value of the participant is theoretically possible:
If not, this can be a reason to exclude the value
If so, run the analysis with and without the participant and compare the results.
Changing your data is a sensitive topic. There can be different arguments for different
choices. For example, some argue that even cases with impossible values (which are clearly
errors) should be included in the analysis because they reflect random measurement error
that the analysis should be able to deal with.