introduction to
Correlational Research Methods
This summary is based on lecture slides, knowledge clips, additional explanations from tutorials and
recommended literature
Laura C. Correlational Research Methods/ Tilburg University/’23-‘24 ~1~
, Introduction to correlational research
Aspects of empirical research
Sampling designs
- simple random sampling: every member in the population has an equal chance to
be sampled
-stratified sampling: the population is
divided based on certain criteria (=strata);
then, from each stratum a random sample
is selected
-convenience sampling: the sample is
made of people who are readily available
(family/friends of the researchers,
students at X university, etc)
Descriptives
- descriptives= ways we use to describe the data we have
e.g: to summarize the data we can look into:
measures of central tendency:
mean
median (the score that separated the higher half of data from the lower half)
mode (the score that is observed most frequently)
measures of dispersion (highlight the differences):
∑(𝑋−𝑀)2
variance; 𝑠 2 = 𝑛−1
∑(𝑋−𝑀)2
standard deviation; 𝑠 = √𝑠 2 = √ 𝑛−1
Inferential statistics
- inferential statistics is used to draw conclusion about a population based on the
information from a sample
- two procedures are popular when it comes to inferential statistics:
null hypothesis testing (H0 testing)
confidence interval estimation
null hypothesis testing (H0 testing) step by step:
1. formulate the null and alternative hypothesis (H0 and H1)
2. set a decision rule
3. obtain the t-value and the p-value from the output
(p-value= the probability under the assumption of no effect or no difference (null hypothesis), of
obtaining a result equal to or more extreme than what was actually observed)
Laura C. Correlational Research Methods/ Tilburg University/’23-‘24 ~2~
, 4. make the decision: reject or keep the null hypothesis
example: does the average exam grade in the population (μ) equals 6.0?
1. H0: μ=6.0 and H1≠6.0
2. the decision rule:
if the p-value < α we reject H0
e.g., if p<.05, we reject H0
output:
3. we have the t-value and the p-value from the output
t(29)=1.815, p=0.074
4. decision: because p>0.05 we keep the null hypothesis = the average exam score is not
statistically different from 6.0
Measurement levels
classical measurements levels:
o nominal scale
o ordinal scale
o interval scale
o ration scale
• nominal scale- consists of a set of categories that have different names; measurements on
a nominal scale label and categorise observations but do not make any quantitative
distinctions between observations
Laura C. Correlational Research Methods/ Tilburg University/’23-‘24 ~3~
, examples of nominal scales include classifying people by race, gender, or
occupation
the measurements from a nominal scale allow us to determine whether two
individuals are different, but they do not identify either the direction or the
size of the difference
• ordinal scale- consists of a set of categories that are organised in an ordered sequence;
measurements on an ordinal scale rank observations in terms of size or magnitude
e.g., an ordinal scale consists of a series of ranks (first, second, third, and so on)
like the order of finish in a car race
with measurements from an ordinal scale, you can determine whether two
individuals are different, and you can determine the direction of difference;
however, ordinal measurements do not allow you to determine the size of the
difference between two individuals
e.g.,: in a NASCAR race, the first-place car finished faster than the second-
place car, but the ranks don’t tell you how much faster
• interval scale- a scale that consists of ordered categories that are all intervals of precisely
the same size; equal differences between numbers on the scale reflect equal differences in
magnitude; the zero point on a scale is arbitrary and does not indicate a zero amount of
the variable being measured
e.g., temperature and IQ scores (a temperature of 0º Fahrenheit does not mean that
there is no temperature, and it does not prohibit the temperature from going even
lower; an IQ score of 0 does not mean one has no intelligence)
• ratio scale- it is an interval scale with the additional feature of an absolute zero point; the
existence of an absolute, non-arbitrary zero point means that we can measure the absolute
amount of the variable; that is, we can measure the distance from 0
e.g., weight, income
-in the context of CRM, we distinguish between categorical (discrete) and
quantitative variables (continuous):
a discrete variable consists of separate, indivisible categories, often whole numbers that vary
in countable steps; no values can exist between two neighbouring categories
o e.g., the number of children a family has, how many students attend a class
each day, classifying people by gender or occupation, etc.
for a continuous variable, there are an infinite number of possible values that fall between
any two observed values; a continuous variable can be divisible into an infinite number of
fractional parts
Laura C. Correlational Research Methods/ Tilburg University/’23-‘24 ~4~
,Research designs
- experimental
- quasi-experimental
- correlational (non-
experimental)
Correlational research- investigating the relationship between variables
e.g., heights+ shoe size, amount of hrs studies + exam grade
Pearson’s correlation coefficient
- Pearson’s correlation coefficient describes a linear association
- ρ (rho)= correlation in the population
r= correlation in the sample
- -1 ≤ r ≤ 1 (r can take any values between -1 and 1)
- if r =o => there’s no linear association (but a non-linear association cannot be
excluded!)
∑N
i=1 zxi ∗ zyi
rxy =
N−1
where
𝑋−𝑋̅ 𝑌−𝑌̅
𝑤ℎ𝑒𝑟𝑒 𝑍𝑥 (𝑡ℎ𝑒 𝑧 𝑠𝑐𝑜𝑟𝑒 𝑜𝑓 𝑥) = 𝑠𝑥
𝑎𝑛𝑑 𝑍𝑦 (𝑡ℎ𝑒 𝑧 𝑠𝑐𝑜𝑟𝑒 𝑜𝑓 𝑦) = 𝑠𝑦
Laura C. Correlational Research Methods/ Tilburg University/’23-‘24 ~5~
,example of non-linear relationships
! always check the scatterplot before interpreting Person’s r
interpretation of correlational strength:
- the strength of a correlation is highly influenced by outliers (=an observation that
lies an abnormal distance from other values in a random sample from a
population)
statistical tests for the correlation coefficient:
H0: ρ=0; H1: ρ≠0
N−2
t-test: t = r√1−r2 ; df = N − 2
! in SPSS we obtain the two-sided p-value for this test
Laura C. Correlational Research Methods/ Tilburg University/’23-‘24 ~6~
,P-value
- the p-value is the probability of
the data in the sample (r) or more
extreme (further ways from 0),
given H0: ρ=0
- after we decide which significance
level to use (usually is 5%=>
α=0.05), if p<α we reject H0
Confidence intervals
- when we talk about 95% CI we mean that if an experiment is carried out multiple
times with different samples, the 95% confidence interval will contain the real
value of the parameter of interest (e.g., ρ) in 95% of cases;
- in 5% of the sample, the 95% CI will not include ρ (the true value)
- most commonly a 95% CI is used because it’s related with α=5% (=0,05)
confidence interval for the mean:
CI(1−α)100% = μ ± critical value(α,two tailed) ∗ SEμ
to calculate the CI for mean we need:
o the mean (M)
o standard deviation (SD)
o sample size (N)
o standard error of the means SEμ
SD
SEr =
√N
o critical value (it depends on the desired confidence level; the critical value for a 95%
CI is 1.96)
confidence interval for r:
CI(1−α)100% = r ± critical value(α,two tailed) ∗ SEr
SEr is the standard error (describes variability in values of the sample statistics-r- if you draw
a large amount of sample from the population)
! CIs for correlation coefficients are not symmetrical (r is usually not in the middle of the CI)
good to know:
- small sample => wider CI interval => less precision
- big sample => narrow CI interval => more precision
- if we keep all things constant and increase the % of the CI (e.g., from 90% to 95%)
the confidence interval becomes wider => we are more confident that the true value
will be within the interval (which is only possible with wider intervals)
- if the CI includes 0 we cannot reject the null hypothesis
Laura C. Correlational Research Methods/ Tilburg University/’23-‘24 ~7~
, Correlational coefficient and simple linear regression
analysis
-assumptions for correlation coefficient:
o there is independence amongst observations (condition satisfied
when a random sample has been used)
o X and Y are linearly related (their relationship can be best described by
drawing a straight line through the scatter plot and there are no- nonlinear
relationships between X and Y)
o there are no extreme bivariate (outliers with extreme
scores on both variables)
2
Squared Correlation 𝑅𝑋𝑌
R= the proportion of the variance in X you can linearly
predict/ explained from Y (and vice versa) = explained
variance
e.g.,
! correlation ≠ causation
- correlations do not allow for causal interpretations, unless they are retrieved from
an experimental study
- using regression models, we can compare different theoretical models
Laura C. Correlational Research Methods/ Tilburg University/’23-‘24 ~8~
, e.g., we cannot conclude from this graph that
eating more chocolate at a national level will lead
to more Nobel prize winners
- there are 3 ways in which we can explain the
relationship between x and y:
o direct
o indirect
o spurious
direct relationship: x determines y
indirect relationship: a mediator Z can be found between x and y (e.g., introversion and
insomnia are positively correlated but it can also be that introverts worry more and therefore
suffer from insomnia; here worrying is a mediator)
spurious relationship: a third variable can cause both x and y (e.g., reading and health are
positively correlated but the educational level can influence both how much a person reads
and what their health is like)
- the correlation coefficient is a measure that describes the linear relationship
between variables (=> the arrow points both ways)
Simple linear regression analysis
- for this we need an independent and dependent variable: X and Y; and the arrow
points towards the dependent variable (because in our theoretical model we
assume that X has an influence on Y and not other way around)
- linear regression means we predict Y from X using a linear function
Laura C. Correlational Research Methods/ Tilburg University/’23-‘24 ~9~
, estimated regression model: Y ′ = b0 + b1 X
where:
Y’ is the predicted value of Y, given X
b0 is the intercept (=the predicted value of
Y’ when someone scored 0 on X; the
estimated mean score of Y in a population
with X=0)
b1 is the regression coefficient (= the
change in Y’ when X increases with one
unit- the slope of the line); interpretation:
When X increases with one unit, Y increases
with … units
b0 and b1 are called the parameters of the model
𝑠
𝑏0 = 𝑌̅ − 𝑏1 ∗ 𝑋̅ and 𝑏1 = 𝑟 ∗ 𝑌
𝑠𝑋
Simple regression analysis step-by-step
1. find the best fitting straight line (find values for the coefficients- b0 and b1- for which
we can best predict Y from X)
2. decide how well you can predict Y: inspect individual prediction error:
ei = Yi − Yi′ , with ei being the prediction error for person i (i= 1,2,3,….,N)
3. check if you can generalize the results to the population level (e.g., using significance
tests or confidence intervals)
find the best fitting straight line (find values for the coefficients- b0 and b1- for which we can
best predict Y from X):
the best fitting straight line is the line for which the prediction errors are smallest (ei):
choose b0 and b1 as such that the individual prediction error (ei) is as small as possible
(=least squares estimation)
method of least squares= we pick a value for the regression coefficients (𝑏0 and
𝑏1) in such a way that the sum of the squared prediction errors (= difference between
the observed score and the predicted score) is as small as possible
the least squares estimators for b0 and b1 can be calculated from the correlation
coefficient (rXY) and the standard deviations (sX and sY)
𝑠𝑌
𝑏1 = 𝑟 ∗
𝑠𝑋
∑(𝑍𝑥 ∗ 𝑍𝑦 ) 𝑋 − 𝑋̅
𝑟= , 𝑤ℎ𝑒𝑟𝑒 𝑍𝑥 (𝑡ℎ𝑒 𝑧 𝑠𝑐𝑜𝑟𝑒 𝑜𝑓 𝑥) =
𝑁 𝑠𝑥
𝑌 − 𝑌̅
𝑎𝑛𝑑 𝑍𝑦 (𝑡ℎ𝑒 𝑧 𝑠𝑐𝑜𝑟𝑒 𝑜𝑓 𝑦) =
𝑠𝑦
Laura C. Correlational Research Methods/ Tilburg University/’23-‘24 ~ 10 ~