CHAPTER 6
MEASURING RELATIONSHIPS BETWEEN TWO QUANTITATIVE VARIABLES
6.1 WHAT IS CORRELATION?
Positive correlation = when the values of variable X and variable Y decrease or increase
together → X increases, Y increases
Negative correlation = when the values of variable X and variable Y change in opposite
directions → X increases, Y decreases
De strength of this relationship is measured by means of a correlation coefficient → ranges
from -1 (perfect negative correlation) to 1 (perfect positive correlation)
Interval- and ratio-scaled variables: Pearson’s product-moment coefficient r
Ordinal data, interval- and ratio-scaled data transformed into ranks: Spearman’s ρ
and Kendall’s τ
6.2 THE PEARSON PRODUCT-MOMENT CORRELATION COEFFICIENT
You can create a scatterplot to visualize the relationship, including a regression line (= line
that shows the general trend in the data)
> plot(variable1 ~ variable2, main = “name of scatterplot”
> m <- lm(variable1 ~ variable2)
> abline(m)
Pearson’s product-moment coefficient r is the most common used correlation coefficient
→ is used for interval- and ratio-scaled data (requirement: normally distributed data)
> cor.test(variabele1, variabele2)
De strength of the r-value is determined as follows:
- Similar to or greater than 0.7 or smaller than -0.7 = strong
- Between 0.3 and 0.7 or between -0.3 and -0.7 = moderate
- Between 0 and 0.3 or between 0 and -0.3 = weak
- Merely 0 = no correlation
→ the closer to 0, the more points deviate from the correlation line in the plot and the
weaker the correlation
Note: a steep slope does not mean that the correlation is strong, it only shows the number of
units by which y will change if x changes.
Fitted value = a value that presents the expected location of a certain x-value on the
correlation line
Observed value = a value that presents the actual/observed value of a particular x-value on
the correlation line
Residuals = difference between the observed values and the fitted values → the smaller the
residuals, the stronger the correlation
1
,REMARKS ON THE PEARSON CORRELATION TEST
1. The relationship between variables should be monotonic and linear
→ a relationship between variables is monotonic when a decrease/increase of X results in a
decrease/increase of Y
→ a relationship between variables is linear when Y decreases/increases to the same extent
as X decreases/increases
A linear relationship is always monotonic, but a monotonic relationship is not always linear!
2. It is very sensitive towards outliers
→ outliers may result in a false correlation because of one or multiple extremely high values
→ these are called leverage points, as they draw the regression line into a particular
direction
Outliers can be excluded from the data:
> variable1_1 <- variable1(variable1 < critical point)
> length(variable1_1)
> variable2_1 <- variable2(variable 1 < critical point)
> length(variable2_1)
Create new regression line:
> m1 <- lm(variable1_1 ~ variable2_1)
> abline(m1, lty = 2)
ASSUMPTIONS OF PEARSON CORRELATION
1. The sample is randomly selected from the population it represents
2. Both variables are at least interval-scaled
3. Both variables come from a bivariate normal distribution (= for any given value of X, the
scores on Y are normally distributed) and/or the sample size is large (>30)
> mvnorm.etest(cbind(variable1_1, variable2_1), R = 999)
(H0 = normality)
4. The residual (error) variance is homoscedastic (= the relationship between variables
should be of equal strength across the entire range of both variables)
> ncvTest(lm(variable1_1 ~ variable1_2))
(H0 = error variance is homoscedastic)
5. The residuals are independent, there is no autocorrelation (= when the value of a variable
depends on its previous or next value)
> durbinWatsonTest(lm(variable1_1 ~ variable1_2))
(H0 = no autocorrelation)
SPEARMAN AND KENDALL
When the relationship is not linear but monotonic, one should use non-parametric
correlation statistics, such as Spearman’s ρ and Kendall’s τ
Spearman’s ρ is identical to Pearson’s r, with ranked scores
> cor.test(variable1, variable2, method = “spearman”)
Kendall’s τ works with differences in the ranks of each pair of observations (x1, y1). A pair of
ranks is concordant if two coordinates x2, y2 are both higher/lower than coordinates x1, y1. A
pair is discordant if one of the two coordinates x2, y2 differs positively, whereas the other
differs negatively regarding coordinates x1, y1 (and vice versa).
2
, This method is preferred when the dataset is small and has tied ranks (when two or more
observations have identical scores and therefore identical ranks)
> cor.test(variable1, variable2, method = “kendall”)
Two assumptions:
1. The sample is randomly drawn from the population
2. Both variables are on the ordinal scale of measurement (they will be transformed to ranks
by R automatically)
CHAPTER 7
MORE ON FREQUENCIES AND REACTION TIMES: LINEAR REGRESSION
7.1 THE BASIC PRINCIPLES OR LINEAR REGRESSION ANALYSIS
Regression explains and models the relationship between the response (dependent) variable,
and one or more explanatory (independent) variables
- one explanatory variable: simple linear regression
- more than one explanatory variable: multiple linear regression
Explanatory variables can be categorical to ratio-scaled, but the response variable should be
on interval or ratio scale
Regression is the same as correlation, but with directionality:
- correlation: the degree to which x and y are related
- regression: how variable x is related to variable y by means of a formula
REGRESSION LINE
A regression line visualizes the relationship between x and y. Its position and orientation can
be described by a formula:
ŷ = b0 + bx
ŷ = the fitted (expected) values of the response variable y
b0 = the intercept, i.e. the predicted value of y when x is equal to zero → when x increases by
one unit, y increases by the intercept
b = the coefficient the determines the slope of the regression line
x = the explanatory variable
The difference between ŷ and the actual value of y are the residuals.
The actual values of y can be described by the following formula:
y=ŷ+ε
So, the observed value of y for a given observation is the sum of its fitted value and the
residual.
3