Linear and Generalized Linear Models
Lecture 1: Linear regression
O&L: 11.1 – 11.6 & 12.1 – 12.6
Regression Analysis
Regression analysis provides the user with a functional relationship between the response variable
(dependent) and explanatory variables (independent). The regression equation can provide estimates
of the response variable for values of the explanatory variable(s) not observed in the study, i.e.
predict response values based on the data that we have.
- Predictions requires a unit of association: there should be an entity that relates the
response and explanatory variables.
A. Simple linear regression
= one response variable is measured by one regressor (= explanatory variable). The equation for
predicting the dependent/response variable is a linear production function. The simple linear
regression equation looks like this:
→ estimated response value = intercept + (slope * independent value x)
This equation gives us the predictable part, but in practice there is always an unpredictable part too.
We call this the random error term (epsilon). This error term includes the effects of all other known or
unknown factors (outside of the intercept and slope of the explanatory variable). The new equation is:
In regression studies, the values of the independent variable (the x values) are usually predetermined
constants, so the only source of randomness is the error term.
Formal assumptions of regression analysis:
1. Linearity → the relation between the response variable
and the explanatory variable(s) is linear. This means
that the slope of the equation doesn’t change if x
changes.
- The errors all have expected values of zero → E(ei) = 0
for all i.
2. Constant variance → the errors all have the same
variance → Var(ei ) = sigma2 for all i.
3. Independence → the errors are independent of
eachother.
4. Normal distribution → the errors are all normally
distributed.
These assumptions are illustrated in Figure 11.2. The actual values of the dependent variable are
distributed normally with mean values falling on the regression line and the same standard deviation at
all values of the independent variable. The only assumption not shown in the figure is independence
from one measurement to another.
These formal assumptions are made in order to derive the significance tests and prediction methods
that follow. To start, we can begin to check these assumptions by looking at a scatterplot of the data.
If the data falls in a straight line, linear regression is reasonable.
- Smoother = sketch a curve through the scatterplot data (e.g. LOWESS & spline fit).
If a scatterplot does not appear linear, it can often be straightened out by a transformation of either
the response or the explanatory variable. The transformed variable should be thought of as simply
another variable. Three common transformations are: square root, natural logarithm, inverse. Finding
a good transformation often requires trial and error.
,B. Estimating model parameters
The regression analysis problem is to fund the best straight-line prediction that fits the scatterplot of
the observed data. With the equation of this line, we can predict unknown values (i.e. the population
quantities based on the sample values). This method is called the least-squared method, because it
chooses beta0 and beta1 to minimize the sum of squares of the errors/residuals. We square the
distances, so the negative & positive distances don’t cancel each other out.
In figure 11.10, the prediction errors/residuals are showed as vertical deviations from the line.
Deviations from the mean (= 14) are indicated by the larger brace.
,1. Slope estimation
The quality of the estimation of the slope b1 is influenced by two quantities: error variance &
variation of the independent variable Sxx (sum of squared deviations of x).
- The greater the variability of the error variance of the y-value for a given x-value, the larger the
variation in the estimated slope b1. If the variability is high around the regression line (b1), it’s
difficult to estimate that line and thus, bad. We want the variability of the error variance to be
small.
- The smaller the Sxx, the harder it is to estimate the rate of change in y, because there is
almost no difference between the x-values in the data. If the price of a brand of diet soda has
not changed for years, it is obviously hard to estimate the change in quantity demanded when
price changes.
2. Intercept estimation
The intercept is the predicted y-value when x = 0. The ideal situation to estimate the intercept b0 is
when the mean (xbar) = 0.
3. Error variance
We also have to estimate the true error variance (variance around the line). The estimate of the true
error variance is based on the residuals (yi – yhat), which are the prediction errors in the sample. The
estimate of the true error variance based on the sample data is the sum of squared residuals / n – 2.
We also call this the mean squared error or mean squared residual (in R: residual standard error2):
OR
In this formula, the degrees of freedom are 2, because we are estimating the intercept & the slope
from the data. By subtracting the 2 degrees of freedom, we make sure that the estimated error
variance (mean squared error) is unbiased and that we won’t be underestimating it.
The square root of the sample variance is called the: sample standard deviation around the regression
line, the standard error of estimate or the residual standard deviation (in R: residual standard error).
Like any other standard deviation, the residual SD may be interpreted by the Empirical Rule (about
95% of the prediction errors/residuals will fall within +- 2 standard deviations of the mean error. This
means that the mean error is always 0 in the least-squares regression model!
The estimate of the regression line can be affected by three types of points:
1. Regression outlier/discrepancy = outliers in the y direction, so very high or very low values
of the dependent variable. This affects the intercept (b0) value slightly.
2. High leverage points = outliers in the x direction, so very high or very low values of the
independent variable. This affects the regression slope (b1), but not substantially.
3. High influence points = outliers in the x and y direction, so very high or very low values of
the dependent and independent variables. This alters the slopes and twist the line badly.
• To have high influence, a point must first have high leverage!
The estimates b0, b1, and se are basic in regression analysis. They specify the regression line, and the
probable degree of error associated with y-values for a given value of x. The next step is to use these
sample estimates to make inferences about the true parameters.
C. Inferences about Regression parameters
The slope, intercept and residual SE in a simple regression model are all estimates based on limited
data. This means that they are affected by random error. We can allow for that random error with the
concepts of hypothesis tests and confidence intervals.
F-test
The F-test was designed to test the null hypothesis that all predictors (x) have no value in predicting y.
This means that H0 says that the model has no predictive value at all.
- F = MSR / MSE.
, - df1 → df of SSR = k = number of b’s involved in H0 (in simple linear regression: 1)
- df2 → df of SSE = n – (k + 1)
T-test
If the F-test is significant, we can test individual coefficients with the t-test. The t distribution can be
used to make significance tests and confidence intervals for the true slope and intercept.
The most common use of this test statistic t is shown in the summary. The first two null hypotheses
are one-sided, and the third one is two-sided. In most computer outputs, this test is indicated after the
standard error and labeled as t-value or t-statistic. Often, a p-value is also given, which eliminates the
need for looking up the t-value in a table.
- Remember: in simple linear regression, a two-sided t-test gives the same result as the F-test.
It is also possible to calculate a confidence interval for the true slope. This is an excellent way to
communicate the likely degree of inaccuracy in the estimate of that slope.
D. Predicting New y-values using Regression
The confidence interval predicts a mean value, so it focuses on current or past values and it only
accounts for model uncertainty.
But what if we want to predict a future individual y-value? Then we need a prediction interval. This
prediction interval is wider than the confidence interval, because it’s harder to estimate an individual y-
value than the mean/expected y-value. This interval accounts for model uncertainty ánd random error.
- Confidence interval → “The average cost E(y) of all resurfacing contracts for 6 miles of road
will be $20,000.”