Statistical Methods for the Social Sciences
Chapter 9; Linear Regression and Correlation
Regression analysis Methods for analyzing association between quantitative response and
explanatory variables.
We present three different, but related, aspects of regression analysis;
1. We investigate whether an association exits between the variables by testing the hypothesis
of statistical independence
2. We study the strength of their association using the correlation measure of association
3. We estimate a regression equation that predicts the value of the response variable from the
value of the explanatory variable
9.1; Linear Relationships
Linear function The formula y= α + βx expresses observations on y as a linear function
of observations on x. The formula has a straight-line graph with slope
β (beta) and y-intercept α (alpha).
Y= response variable and x= explanatory variable
We analyze how values of y tend to change from one subset of the population to another, as defined
by values of x.
At x=0, the equation y= α + βx simplifies to y= α + βx = α + β (0) = α.
The slope β equals the change in y for a one-unit increase in x. The larger the absolute value of β, the
steeper the line.
- When β is positive, y increases as x increases- the straight line goes upwards. When a
relationship between two variables follows a straight line with β>0, the relationship is said to
be positive.
- When β is negative, y decreases as x increases. The straight line then goes downward, and
the relationship is said to be negative.
When β=0, the graph is a horizontal line. The value of y is constant and does not vary as x varies, the
two variables are independent.
An association does not imply causation.
9.2; Least Squares Prediction Equation
Scatterplot A plot of the n observations as n points. The scatterplot provides a
visual check of whether a relationship is approximately linear.
Regression outlier The point falls quite far from the trend that the rest of the data
follow. The line seems to be pulled up toward that point and away
from the center of the general trend of points. An observation is
called influential if removing it results in a large change in the
prediction equation. Unless the sample size is large, an observation
can have a strong influence on the slope if its x-value is low or high
compared to the rest of the data and if it is a regression outlier.
Residual For an observation, the difference between an observed value and the
predicted value of the response variable, y− y is called the residual.
The prediction errors are called residuals.
Least squares estimate The least squares estimates a and b are the values that provide the
, prediction equation ^y =a+bx for which the residual sum of squares,
SSE= ∑ ( Y − ^y )2, is a minimum.
To estimate the line y= α + βx we use ^y =a+bx . This formula is called the prediction equation.
Σ ( x−x ) ( y− y )
b= 2
Σ ( x−x )
a= y−b x
- A positive residual results when the observed value y is larger than the predicted value y , so
y− y> 0.
- A negative residual results when the observed value is smaller than the predicted value. The
smaller the absolute value of the residual, the better is the prediction, since the predicted
value is closer to the observed value.
In a scatterplot, the residual for an observation is the vertical distance between its point and the
prediction line.
We summarize the size of the residuals by the sum of their squared values. This quantity, denoted by
SSE (sum of squared errors), is SSE= Σ ( y−^y )2. The better the prediction equation, the smaller the
residuals tend to be and, hence, the smaller SSE tends to be.
The prediction line ^y =a+bx is called the least squares line, because it is the one with the smallest
sum of squared residuals.
The least squares line;
- Has some positive residuals and some negative residuals, but the sum (and mean) of the
residual equals 0
- Passes through the point ( x , y )
9.3; The Linear Regression Model
Deterministic For the linear model y=α + βx , each value of x corresponds to a
single value of y. Such a model is said to be deterministic. It is
unrealistic in social science research, because we do not expect all
subjects who have the same x-value to have the same y-value,
instead, the y-values vary.
Conditional distribution It is the conditional distribution of the y-values at x=12. A separate
conditional distribution applies for those with x=13.
Probabilistic model A probabilistic model for the relationship allows for variability in y at
each value of x.
Expected Value of y Let E(y) denote the mean of a conditional distribution of y. The symbol
E represents expected value.
Regression function A regression function is a mathematical function that describes how
the mean of the response variable changes according to the value of
an explanatory variable.
Conditional standard The linear regression model has an additional parameter σ describing
deviation the standard deviation of each conditional distribution. That is, σ
measures the variability of the y-values for all subjects having the
same x-value.
An equation of the form E ( Y )=α + βx that relates values of x to the mean of the conditional
distribution of y is called a regression function.
, The function E ( Y )=α + βx is called a linear regression function, because it uses a straight line to
relate the mean of y to the values of x.
The estimate of σ uses SSE= Σ ( y−^y )2, which measures sample variability about the least squares
2
line. The estimate is s= SSE = Σ y− y .
√ n−2
(
√
^)
n−2
The term (n – 2) in the denominator of s is the degrees of freedom (df) for the estimate. When a
regression equation has p unknown parameters, then df = n – p. The equation E ( y )=α +β x has two
parameters (α ∧β ), so df= n – 2.
2
Σ ( y− y ) .It differs from the
Estimate of the population standard deviation of a variable y is
√
standard deviation of conditional distribution of y, for a fixed value of x.
n−1
9.4; Measuring Linear Association; The Correlation
Correlation The correlation between variables x and y, denoted by r, is
Σ ( x−x )( y − y )
r=
❑
√ ( Σ ( x−x ) )( ∑ ( y− y )2 )
2
Correlation is a The correlation relates to the slope b of the prediction equation
standardized slope sx
^y =a+bx by r = ( )
sy
b.
The slope b of the prediction equation tells us the direction of the association. The slope does not
directly tell us the strength of the association. The slope is useful for comparing effects of two
predictors having the same units.
2
Σ ( x−x )
sx=
√ n−1
2
∑ ( y− y )
sy=
√ n−1
If the sample spreads are equal ( s x =s y ), then r=b. Because of the relationship between r and b, the
correlation is also called the standardized regression coefficient for the model E ( y )=α+ β x .
- Correlation is valid only when a straight-line model is sensible for the relationship between x
and y. Since r is proportional to the slope of a linear prediction equation, it measures the
strength of the linear association
- −1 ≤r ≤ 1. The correlation, unlike the slope b, must fall between -1 and +1
- R has the same sign as the slope b. This holds because their formulas have the same
numerator, relating to covariation of x and y, and positive denominators. Thus, r>0 when the
variables are positively related, and r<0 when the variables are negatively related.
- R=0 for these lines having b=0. When r=0, there is not a linear increasing or linear decreasing
trend in the relationship.
- r =±1 when all the sample points fall exactly on the prediction line these correspond to
perfect positive and negative linear associations.
- The larger the absolute value of r, the stronger the linear association.