Week 1
Fitting line to scatter of data
In a scatter diagram with 𝑛 paired observations (𝑥! , 𝑦! ), 𝑖 = 1, … , 𝑛, we want to find the line
that gives the best fit to these points. The line is given by 𝑦 = 𝑎 + 𝑏𝑥
Terminology
𝑦: variable to be explained, dependent variable, endogenous variable
𝑥: explanatory variable, independent variable, exogenous variable, regressor, covariate
Deviation
𝑒! is the error that we make in predicting 𝑦! , so 𝑒! = 𝑦! − 𝑎 − 𝑏𝑥!
Ordinary least squares (OLS)
Minimalize the sum of squares of the errors, so minimalize 𝑆(𝑎, 𝑏) = ∑𝑒!" . By computing
#$ #$ ∑()! *)̅ )(-! *-.)
#%
= 0 and #& = 0, 𝑎 = 𝑦4 − 𝑏𝑥̅ and 𝑏 = ∑()! *)̅ )"
can be found.
Least square residuals
Given the observations and the corresponding unique values of 𝑎 and 𝑏, we obtain the
residuals 𝑒! . These have two properties: ∑𝑒! = 0 and ∑(𝑥! − 𝑥̅ )𝑒! = 0, so the mean of the
residuals is 0, and 𝑥! and 𝑒! are uncorrelated
Sum of squares
From 𝑎 = 𝑦4 − 𝑏𝑥̅ we get that 𝑦! − 𝑦4 = 𝑏(𝑥! − 𝑥̅ ) + 𝑒! . Then, the sum of squares of (𝑦! − 𝑦4)
is ∑(𝑦! − 𝑦4)" = 𝑏 " ∑(𝑥! − 𝑥̅ )" + ∑𝑒!" . In words: the total sum of squares (SST) equals the
explained sum of squares (SSE) plus the sum of squared residuals (SSR), SST = SSE + SSR
Coefficient of determination: 𝑅"
The coefficient of determination, denoted by 𝑅" , is defined as
"
" $$/ $$1 & " ∑()! *)̅ )" 2∑()! *)̅ )(-! *-.)3
𝑅 = 1 − $$0 = $$0 = ∑(-! *-.)"
= ∑() *)̅ )" ∑(- *-.)" , so 𝑅" is equal to the correlation
! !
coefficient between 𝑥 and 𝑦. It holds that 0 ≤ 𝑅" ≤ 1, and the closer it is to 1, the better
Data generating process (DGP)
The data is generated with the equation 𝑦! = 𝛼 + 𝛽𝑥! + 𝜀! , 𝑖 = 1, … , 𝑛. The 𝑥 variables are
fixed, 𝛼 and 𝛽 are chosen, and 𝑛 𝜀 are generated with a variance 𝜎 " . Now, the data points
will be around the line 𝛼 + 𝛽𝑥
Random variation
If only the data set (𝑥! , 𝑦! ), 𝑖 = 1, … , 𝑛 is known, the underlying values of 𝛼, 𝛽, 𝜎 and 𝜀! are
not known. 𝑎 and 𝑏 can be seen estimators of 𝛼 and 𝛽. And because 𝜀! is random, 𝑦! is
random, so 𝑎 and 𝑏 are also random. With 1 observation, 𝑣𝑎𝑟(𝑏) can be computed under
assumptions
DGP assumptions
A1. 𝑥! are not random (fixed) with ∑(𝑥! − 𝑥̅ )" ≠ 0 (they are not on a vertical line)
A2. 𝜀! are random with 𝐸(𝜀! ) = 0
A3. 𝑣𝑎𝑟(𝜀! ) = 𝐸(𝜀!" ) = 𝜎 " , homoskedastic
A4. 𝐶𝑜𝑟B𝜀! , 𝜀4 C = 𝐸B𝜀! 𝜀4 C = 0 ∀𝑖 ≠ 𝑗, no serial correlation of errors
A5. 𝛼, 𝛽, 𝜎 are fixed unknown numbers with 𝜎 > 0
A6. 𝑦! = 𝛼 + 𝛽𝑥! + 𝜀! , linear model
A7. 𝜀5 , … , 𝜀6 are jointly normally distributed
Under these assumptions, 𝑦! ~ 𝑁(𝛼 + 𝛽𝑥! , 𝜎 " )
Notation
𝛼, 𝛽, 𝜎 and 𝜀! are unknown, 𝑦! and 𝑥! are known observed data, 𝑎, 𝑏 and 𝑒! are known and
derived from 𝑦! and 𝑥!
, Statistical properties
) *)̅ ) *)̅
𝑏 can be written as 𝑏 = 𝛽 + ∑𝑐! 𝜀! , where 𝑐! = ∑() !*)̅ )) = ∑()! *)̅ ). It follows that 𝐸(𝑏) = 𝛽,
! ! !
7"
so 𝑏 is an unbiased estimator of 𝛽, and that 𝑣𝑎𝑟(𝑏) = ∑() *)̅ )" .
!
5 )̅ ()! *)̅ )
𝑎 can be written as 𝑎 = 𝛼 + ∑𝑑! 𝜀! , where 𝑑! = 6 − ∑() *)̅ )" . It follows that 𝐸(𝑎) = 𝛼, so 𝑎
!
is an unbiased estimator of 𝛼
Best linear unbiased estimator (BLUE)
With estimators, there is a bias-variance tradeoff. A estimator with the smallest MSE is
preferred, 𝑀𝑆𝐸 = 𝐸((𝑏 − 𝛽)" ) = 𝑣𝑎𝑟(𝑏) + (𝑏𝑖𝑎𝑠)" . The Gaus-Markov theorem says that
the OLS estimators 𝑎 and 𝑏 is BLUE. Linear means that they are a linear combination of 𝑦! ,
and best means 𝑣𝑎𝑟(𝑏) ≤ 𝑣𝑎𝑟(𝑏 ∗ ) for every LUE 𝑏 ∗ . This theorem holds if A1-A6 hold.
In the class of LUE, we cannot do better than OLS, but we can if we allow bias, or non-linear
estimators
𝑡-test
It holds that 𝑏 ≠ 0 even if 𝛽 = 0, but it needs to be checked if 𝛽 = 0. For that, a 𝑡-test is
used with 𝐻9 : 𝛽 = 0 vs 𝐻5 : 𝛽 ≠ 0. 𝐻9 is rejected if OLS 𝑏 is “far enough from 0”.
7"
Under A1-A7, 𝑏 ~ 𝑁(𝛽, 𝜎&" ), where 𝜎&" = ∑() "
. If we standardize 𝑏 we get
# *)̅ )
&*:
𝑧= ~ 𝑁(0,1). So, we reject 𝐻9 if 𝑧 is “far enough from 0”. With a 5%
7/<∑()! *)̅ )"
significance level, this means if |𝑧| > 1.96 ⟺ |𝑏| > 1.96 ∙ 𝜎& .
&*: >
𝜎& is unknown, so it is replaced with 𝑠, then 𝑡 = $1(&) ~ 𝑡(𝑛 − 2), where 𝑆𝐸(𝑏) =
<∑()! *)̅ )"
(standard error of 𝑏). So, we reject 𝐻9 if |𝑡| > 𝑐 ⟺ |𝑏| > 𝑐 ∙ 𝑆𝐸(𝑏). With a 5% significance
level, 𝑐 = 2 can be used
Estimating 𝜎
5
Use the estimator 𝑠 " = 6*" ∑𝑒!" as an unbiased estimator. 𝑠 is the standard error of
regression. The intuition for 𝑛 − 2 is that (𝑒5 , … , 𝑒6 ) are not independent, because the OLS
gives 2 restrictions: ∑𝑒! = 0 and ∑𝑥! 𝑒! = 0. From the first 𝑛 − 2 term, the last two follow
Other methods
The P-value can also be used to test the 𝐻9 . This is the probability that under the 𝐻9 to
obtain the observed test value or value more in the direction of 𝐻5 .
&*:
An interval can also be used. Take 𝑐 such that 𝑃(−𝑐 ≤ 𝑡 ≤ 𝑐) = 0.95, so −𝑐 ≤ $1(&) ≤ 𝑐.
Then the interval is 𝑏 − 𝑐 ∙ 𝑆𝐸(𝑏) ≤ 𝛽 ≤ 𝑏 + 𝑐 ∙ 𝑆𝐸(𝑏)
Prediction
The prediction is for 𝑦6?5 . Assume that A1-A7 hold for (𝑥! , 𝑦! ), 𝑖 = 1, … , 𝑛 + 1, but 𝑦6?5 is
not yet observed. The actual outcome will be 𝑦6?5 = 𝛼 + 𝛽𝑥6?5 + 𝜀6?5 , but the point
prediction is 𝑦Z6?5 = 𝑎 + 𝑏𝑥6?5 .
The forecast error is 𝑓 = 𝑦6?5 − 𝑎 − 𝑏𝑥6?5 = (𝛼 − 𝑎) + (𝛽 − 𝑏)𝑥6?5 + 𝜀6?5 . It holds that
5 ()$%# *)̅ )"
𝐸(𝑓) = 0 and 𝑣𝑎𝑟(𝑓) = 𝜎 " \1 + 6 ∑()! *)̅ )"
]. This variance is higher than the variance 𝜎 " of
the errors. That is because 𝑎 and 𝑏 are used instead of 𝛼 and 𝛽
Prediction interval
@ @
If A1-A7 hold true for 𝑖 = 1, … , 𝑛 + 1, then A%B(@) ~ 𝑁(0,1). And > ~ 𝑡(𝑛 − 2) with
&
5 ()$%# *)̅ )"
𝑠@" "
= 𝑠 \1 + 6 ∑()! *)̅ )"
]. So, a 1 − 𝛼 prediction interval for 𝑦6?5 is given by
B𝑎 + 𝑏𝑥6?5 − 𝑐𝑠@ , 𝑎 + 𝑏𝑥6?5 + 𝑐𝑠@ C, with 𝑐 such that𝑃(|𝑡| > 𝑐) = 𝛼 when 𝑡 ~ 𝑡(𝑛 − 2).
For a 95% interval, 𝑐 = 2 can be used