RM | Unit 130 - Covariance, Correlation, and R-squared
Book: Analysing Data Using Linear Models
Chapter 4: 4.8, 4.9, 4.10, 4.11, 4.12, 4.13, 4.14
Chapter 4.8: Pearson correlation
We see that the regression line describes data set A very well (left panel): the observed dots are very close
to the line, which means that the residuals are very small. The regression line does a worse job for data set
B (right panel) since there are quite large discrepancies between
the observed Y -values and the predicted Y -values. Put
differently, the regression equation can be used to predict Y -
values in data set A very well, almost without error, whereas the
regression line cannot be used to predict Y -values in data set B
very precisely. The regression line is also the least squares
regression line for data set B, so any improvement by choosing
another slope or intercept is not possible.
In order to get to Pearson’s correlation coefficient, you first need to standardise both
the independent variable, X, and the dependent variable, Y. You standardise scores
by taking their values, subtract the mean from them, and divide by the standard
deviation. So, in order to obtain a standardised value for X = x we compute zX, zX =
x − X σX (4.15) and in order to obtain a standardised value for Y = y we compute zY
, zY = y − Y σY.
the slopes are different: in data set A, the slope is 0.997 and in data set B, the slope is
0.376. ZY = 0 + 0.997 × ZX = 0.997 × ZX (4.17) ZY = 0 + 0.376 × ZX = 0.376 × ZX (4.18) These two
slopes, the slope for the regression of standardized Y -values on standardized X-values, are the correlation
coefficients for data sets A and B, respectively. For obvious reasons, the correlation is sometimes also
referred to as the standardised slope coefficient or standardised regression coefficient.
→ The correlation is bidirectional: the correlation between Y and X is the same as the correlation
between X and Y.
In summary, the correlation coefficient indicates how well one variable can be predicted
from the other variable. It is the slope of the regression line if both variables are standardised. If
prediction is not possible (when the regression slope is 0), the correlation is 0, too. If the prediction is
perfect, without errors (no residuals) and with a slope unequal to 0, then the correlation is either -1 or +1,
, depending on the sign of the slope. The correlation coefficient between variables X and Y is usually
denoted by rXY for the sample correlation and ρXY (pronounced ’rho’) for the population correlation.
Chapter 4.9: Covariance
Through the division of X and Y -values by their respective standard deviation. There exists also an
unstandardised measure for how much two variables co-relate: the covariance. The correlation ρXY is
the slope when X and Y each have variance 1. When you multiply correlation ρXY by a quantity
indicating the variation of the two variables, you get the covariance. This quantity is the product of the
two respective standard deviations. The covariance between variables X and Y , denoted by σXY , can be
computed as: σXY = ρXY × σX × σY (4.19)
For example, if the variance of X equals 49 and the variance of Y equals 25, then the respective
standard deviations are 7 and 5. If the correlation between X and Y equals 0.5, then the covariance
between X and Y is equal to 0.5 × 7 × 5 = 17.5.
Similar to the correlation, the covariance of
two variables indicates by how much they co-vary.
For instance, if the variance of X is 3 and the
variance of Y is 5, then a covariance of 2 indicates
that X and Y co-vary: if X increases by a certain
amount, Y also increases. If you want to know how
many standard deviations Y increases if X increases
with one standard deviation, you can turn the
covariance into a correlation by dividing the
covariance by the respective standard deviations.
ρXY = σXY σXσY = 2 √ 3 √ 5 = 0.52. Similar to
correlations and slopes, covariances can also be
negative. Instead of computing the covariance on
the basis of the correlation, you can also compute
the covariance using the data directly. The formula
for the covariance is σXY = P(Xi − X)(Yi − Y ) n)
126, so it is the mean of the squared cross-products
of two variables.