Chapter 2 The Simple Linear Regression Model
2.1 An Economic Model
We will use lowercase letters, like ‘y’, to denote random variables as well as their values.
Probability density function (pdf) -> describes the probabilities of obtaining various values.
The expected value of a random variable is called its ‘‘mean’’ value, which is really a contraction of
population mean (conditional mean or expected value), the center of the probability distribution of
the random variable. This is not the same as the sample mean, which is the arithmetic average of
numerical values.
Simple regression function (only one explanatory variable):
2.2 An Econometric Model
In order to make the economic model complete we have to make some assumptions. Assumptions
are the ‘if’ part of an ‘if-then’ type statement. If the assumptions we make are true, then certain
things follow. And if the assumption do not hold, then the conclusions we draw may not hold. Part of
econometric analysis is making realistic assumptions and then checking that they hold.
The dispersion of the values y about their mean is the variance. The basic assumption is that the
dispersion of values y about their mean is the same for all levels of income x -> the constant variance
assumption implies that at each level of income x we are equally uncertain about how far values of y
might fall from their mean value and the uncertainty does not depend on x or anything else. Data
satisfying this condition are said to be homoscedastic (and heteroskedastic when violated).
We have described the sample as random. This description means that when data are collected they
are statistically independent. When describing per person food expenditures of two randomly
selected households, then knowing the value of one of these (random) variables tells us nothing
about the probability that the other will take a particular value or range of values.
In order to carry out a regression analysis, we must make two assumptions about the values of the
variable x. The idea of regression analysis is to measure the effect of changes in one variable, x, on
another, y. In order to do this x must take at least two values within the sample of data. If all the
observations on x within the sample take the same value, say x = $1,000, then regression analysis
fails. Secondly, we will assume that the x-values are given, and not random. All our results will be
conditional on the given x-values.
,Finally, it is sometimes assumed that the values of y are normally distributed. The usual justification
for this assumption is that in nature the ‘‘bell-shaped’’ curve describes many phenomena. It is
reasonable, sometimes, to assume that an economic variable is normally distributed about its mean.
It is an ‘‘optional’’ assumption, since we do not need to make it in many cases, and it is a very strong
assumption when it is made.
These ideas together define our econometric model -> collection of assumptions that describe the
data.
(1) = Econometric Model
(2) = Strict Exogeneity
(3) = Conditional Homoscedasticity (and for each observation the variance is the same, so that
for no observations will the model uncertainty be more, or less, nor is it directly related to
any economic variable).
(4) = Conditionally Uncorrelated Errors
(5) = Variation in Explanatory Variable
(6) = Error Normality
2.2.1 Introducing the Error Term
The essence of regression analysis is that any observation on the dependent variable y can be
decomposed into two parts: a systematic component and a random component. The systematic
component of y is its mean, E(y/x) = y - β1 + β2x, which itself is not random since it is a mathematical
expectation. The random component of y is the difference between y and its conditional mean value
E(y/x). This is called a random error term, and it is defined as e = y - β1 + β2x. If we rearrange this we
obtain the simple linear regression model y = β1 + β2x + e. Thus, E(e/x) = E(y//x) - β1 - β2x = 0 -> since
the error term is random.
Since y and e differ only by a constant (i.e., a factor that is not random),
their variances must be identical and equal to σ2. Thus the probability
density functions for y and e are identical except for their location. Notice
that the center of the pdf for e is zero, which is it’s expected value.
We can now explain the simplifying assumption that x is not random. The
assumption that x is not random means that its value is known. In
,statistics such x-values are said to be ‘‘fixed in repeated samples.’’ If we’d perform controlled
experiments, the same set of x-values could be used over & over, so that only the outcomes y are
random. As an example, suppose that we are interested in how price affects the number of Big Macs
sold weekly at the local McDonald’s. The franchise owner can set the price (x) and then observe the
number of Big Macs sold (y) during the week. The following week the price could be changed, and
again the data on sales collected. In this case x - the price of a Big Mac is not random, but fixed.
The number of cases in which the x-values are fixed is small in the world of business and economics.
When we survey households we obtain the data on variables like food expenditure per person and
household income at the same time. Thus y and x are both random in this case; their values are
unknown until they are actually observed. However, making the assumption that x is given, and not
random, does not change the results we will discuss. The additional benefit from the assumption is
notational simplicity. Since x is treated as a constant non-random term, we no longer need the
conditioning notation ‘‘/’’. So, instead of E(e/x) = 0 you will see E(e) = 0. There are some important
situations in which treating x as fixed is not acceptable.
One interesting difference: y is “observable” and e is
“unobservable”. If the regression parameters are
known, then for any value of y we can calculate the
error; so we can separate the fixed and random parts
of y. However, the regression parameters are never
known, and it is impossible to calculate e. What
compromises the error term e? The random error term
represents all factors affecting y other than x. These factors cause individual observations y to differ
from the mean value.
If we have omitted some important factor, or made any other serious specification error, then
assumption SR2 E(e) = 0 will be violated, which will have serious consequences.
2.3 Estimating the Regression parameters
REMARK: It will be our notational convention to use i subscripts for cross-sectional data observations,
with the number of sample observations being N. For time-series data observations we use the
subscript t and label the total number of observations T. In purely algebraic or generic situations, we
may use one or the other.
, 2.3.1 The Least Squares Principle
Least squares principle -> asserts that to fit a line to the data values we should make the sum of
the squares of the vertical distances from each point to the line as small as possible. The intercept
and slope of this line, the line that best fits the data using the least squares principle, are b1 and b2.
The vertical distances from each point to the fitted line are the least squares
residuals (the distance between the line and the actual points).
The least square estimates have the property that the sum of their squared
residuals is less than the sum of squared residuals for any other line. That is:
The formula for b2 reveals why we had to assume [SR5]
that the values of x were not the saem value for all
observations. B2 would mathematically be undefined and
does not exist since the numerator and denominator are
zero.
The formulas for the b’s are perfectly general and can be
used no matter what the sample values turn out to
be. This should ring a bell -> random variables.
- Least squares estimators are general formulas
and are random variables.
- Least squares estimates are numbers that we
obtain by applying the general formulas to the
observed data.
If we have no observations in the region where
income is zero, then our estimated relationship may
not be a good approximation to reality in that region.
Any time you ask how much a change in one variable
will affect another variable, regression analysis is a
potential tool.
2.3.3a Elasticities
Income elasticity is a useful way to characterize the responsiveness of consumer expenditure to
changes in income. Most commonly elasticity is calculated at
the “point of the means” because it is a representaive point
on the regression line.
2.4 Assessing the Least Squares Estimators
The least squares estimates are numbers that may or may not be
close to the true parameter values, and we will never know. The
motivation for this is this: if we were to collect another sample of data, by choosing another set of 40
households to survey, we would have obtained different estimates b1 and b2, even if we had
carefully selected households with the same incomes as in the initial sample. This sampling variation
is unavoidable (because random variables, their values are not known until the sample is collected).
Consequently, when viewed as an estimation procedure, b1 and b2 are also random variables,
because their values depend on the random variable y. In this context we call b1 and b2 the least
squares estimators.