Lecture 9: Association interval and ordinal variables
When changes in one variable corresponds to similar changes in another variable = positive
correlation → represented by correlation coefficient (r), that has positive value up to a max of 1.
→ correlation doesn’t imply causation (one variable does not cause change in another variable) →
it just measures changes in variables that co-occur.
Correlation of zero when changes in one variable bear no relation to changes in another.
When changes in one variable correspond with opposite changes in another = negative
correlation → represented by correlation coefficient (r), that has a negative value to a minimum
of -1 → fast vs slow, heavy vs light and reflected movements.
→ the size of the correlation coefficient indicates the strength of the
relationship between the two variables.
Association measures for interval and ordinal variables: see picture.
→ smallest correlation is -1 and the largest correlation is +1.
Covariance: example → 5 friends give a movie a score →
second variable is their age → what do we observe when we look at these 2 different
variables? (see pictures left).
- If one variable goes up (score) the other goes down (age) → at
first graph you see for the first friend the score is above average, but the
age is below average → counts for all friends → can also put them in the
same graph (see picture right), with score on y-axis and age on x-axis →
age en score covary → says something about direction of the
association → is a negative association → covariation tells something
about direction, but not yet about the strength of an association.
Example 2: 3 lecturers (A, B and C) that all graded the same assignments.
- Lecturer A: 2, 7, 8, 8, 10 → average grade = 7, standard deviation = 3.
- Lecturer B: 1, 6, 7, 7, 9 → average grade = 6, standard deviation = 3.
- Lecturer C: 3, 7, 5, 7, 8 → average grade = 6, standard deviation = 2.
→ can compare lecturer A and B → B grades assignments one point lower than A → their scores
vary identically (SDs are similar) → means they covary fully and the grades correlate max.
→ can compare C and A → C grades assignments with less variance compared to A → not
identical positions (sometimes C grades higher, sometimes A) → means they do covary, but less
than B covaries with A → grades from C and A correlate, but not max.
Covariance for A and C: lecturer A on x-axis and C on y-axis → coordinate
system through centre of gravity: (𝑥, 𝑦) = (7, 6) → can calculate x-deviations
2
compared to average of x (7) → gives x-dev: 𝑑𝑥 = -5, 0, 1, 1, 3 → 𝑠𝑥 = Σ(dx·
dx)/(n-1) = (25+0+1+1+9)/(4) = 9 (variance of x).
→ covariance is similar, but with x- and y-deviations: Σ(dx·dy)/(n-1) → so we also
need y-deviations: -3, 1, -1, 1, 2 → Σ(dx·dy) = (-5 x -3) + (0 x 1) + (1 x -1) + (1 x 1) + (3 x
2) =21 → cov = 21/4 = 5.25
Covariance is combined variance:
→ can be positive or
negative and is an indication of the correlation → covariance =
left graph = + 5.25, right graph = -2.0 → scale-sensitive →
depends on scale what will be the value of the covariance →
,covariance gives direction (negative or positive) → r = (cov)/(sxsy) = 5.25/(3x2) = 0.875 → r2 = 0.77
(77% linearly explained) → r is not scale-sensitive → this means you can compare the correlation
coefficient rho (r) in different studies → r also indicates whether a correlation is large or small →
summary r:
- r = coefficient of linear association (standardized covariance) → –1 ≤ r ≤ +1 → sign in
front of r shows whether it is a negative/positive correlation.
- r = standardized regression coefficient b in case of simple regression (when you have
only 1 independent variable).
- r2 = proportion variation in y linearly explained by X (covariance is not an association
measure, because it is scale-sensitive).
- Example if r = –0.5 → clearly negative correlation → a 1.0 sx increase in x associates with a
0.5 sy decrease in y → r2 = 0.25, so 25% Y-variation linearly explained by X
- r2 < .09: weak linear association
- 0.09 ≤ r2 < 0.25: medium linear association
- r2 ≥ 0.25: strong linear association
Eta vs r: eta = more general measure for dependency Y on X → eta2 = proportion variation Y
explained by x (see lecture 3) → eta2 ≥ r2 (because r2 is the proportion linearly explained) →
advantage eta: (1) variable X can take on every measurement level and (2) it is a more general
association → disadvantage eta: (1) it is less specific than r (because it has no direction) and (2)
eta Y on X ≠ eta X on Y, so eta is not a symmetrical measure (r = symmetrical measure).
Picture left gives covariance of 2 variables
(consumption of cheese vs number of people that died
by becoming tangled up in their bedsheets) → high
correlation: r = 0.95 → however, this correlation
doesn’t make any sense.
→ you can find correlation and association between
variables that is high, but that doesn’t make any sense → so you also have to base
the selection of variables on existing research and theories.
Rank correlation: use rank correlation measure if; (a) one or both variables are of
ordinal measurement level or (b) with scale variables whereby the trend is not
increasing or decreasing, but curved (see picture right).
→ advantage rank correlation: can use it more general → disadvantage rank
correlation: it is less specific → have 2 different rank correlation measures: (1)
Spearman’s rS and (2) Kendall's tau.
- Rank correlation coefficient of
Spearman's rho (rs): rs is similar to r, but now
we apply it to ranks scores → have scores of
lecturer A and lecturer C (see graph left) with r
= 0.875 → use rank scores, we rank the scores
→ ranking position 1, 2, 3, 4 and 5 → because 3
and 4 have an equal position, so we need to
take the average (3.5 twice) → have to rank
both on y-axis and on x-axis → these rank
scores are used in the calculation.
→ covariance: deviation of x (lecturer A) is multiplied by the deviation of y (lecturer C) → needs
to be divided by the 2 standard deviations → the rs appears to be a little lower than the Pearson
, correlation that was calculated before → calculation is similar, but instead of using original
scores, we use the rank scores.
Kendalls tau (τ): considers all the pairs of points → a pair of point is called a concordant if 1 point
in a pair is higher in terms of the x- and the y-value → if 1 point in pair has both a higher x- and a
higher y-value → concordant when upward direction of arrows (see picture right).
- Number of concordant pairs k+ = 7 (number of arrows in upward direction).
- Number of discordant pairs k- = 1 (number of arrows in downward
direction) → x-value is larger for point to the right, but y-value is
larger for point to the left (for one point x-value is higher and for one
point the y-value is higher).
- Number of neutral pairs = 2 (one pair with same y-value and one with
same x-value).
→ when x- and y-value are similar then it is exactly on the same spot.
tau-a = proportion of concordant - discordant pairs →
→ if we include the neutral pairs you get tau-b and tau-c → don’t calculate this by hand, but
through SPSS → gives tau-b = 0.67 and tau-c = 0.64.
Picture left shows 4 examples of correlations → left top: r = 0.9, so
positive correlation → top right: r = -0.3, so negative correlation and
association is less strong, because the points are more spread and the
slope is less steep → bottom left: rs = higher than r, because the
correlation is a bit curved → bottom right: it is also a curved pattern,
so eta is more suitable.
Correlation in SPSS: Menu <Analyze> <Correlate> <Bivariate...>;
- Tick: Pearson, Kendall’s tau-b, Spearman;
- Select the two variables;
- ‘Test of Significance’: ‘One-tailed’ or
‘Two-tailed’ (dependent on hypothesis);
<OK>
Output: pictures right → 1-tailed so suitable for
directed, which is the case for the grades in the
example, because of the positive correlation → correlation is symmetrical,
because for C and A the Pearson's Correlation is 0.875 and for A and C it is
also 0.875 (is a symmetrical matrix) → for the Kendall’s tau, the p-values are higher, which
means that this measure has less power.
- So for lecturer example: Pearson Correlation = 0.875 (r) and p1 = 0.026 → Spearman’s rho =
0.763 (rs) and p1 = 0.067 → Kendall’s tau_b = 0.667 (tau) and p1 = 0.059.
- r is most extreme (because of the outlier) → values rS and tau are smaller and just not
significant → p for rS and tau are almost similar.
3 correlation tests: to find out the statistical significance.
1. Student distribution: testing H0: ρ = 0 → test statistic: → student
(n-2) distributed → SPSS calculates the t with the exceedance probability p.
2. Spearman's rho: testing H0: ρ𝑠 = 0 → identical formulation, but for r we use rs in the
formula for t.