SUMMARY LECTURES – EMPIRICAL FINANCE
Week 1
An example of a linear regression is:
In a linear regression (linear-linear; linear on both sides), if wage is expressed in dollars, if
education goes up by 1 year, then the wage increases with β1 dollars, given that the amount of
experience stays the same.
A faster way to write down a linear regression is to use vector notation:
The general linear regression with the faster way is thus:
To use linear regression (OLS), the variables can be non-linear, but the parameters must be
linear. If a parameter is non-linear, one can take the ln() of that parameter and the ln() of the
dependent variable to make it linear. If we have ln() both on the left and right hand side (log-
log), the interpretation of the betas changes. If an independent variable increases x%, then the
dependent variable increases with the respective β%, given that the amount of experience stays
the same.
Taking the ln() can also be handy to rescale the data so that the variance is more constant. This
overcomes heteroskedasticity (Week 3). It also helps to make a positively skewed distribution
closer to a normal distribution:
In this example (log-linear), if education goes up by 1 year, then the wage increases with
β1*100%, given that the amount of experience stays the same.
,If we go back to the first example (with just Wage), but with the ln(Education) (linear-log), then
the beta has again a different interpretation. If education increases by 1%, wage increases by
1%*β1, again given that the amount of experience stays the same.
If regressors (x-variables) are dummies (categorical variables), they should not be perfectly
collinear (also known as multicollinearity). This happens if you can express one regressor as a
linear combination of the remaining regressors. You can never include two dummies for the
same variable (like male and female for sex) and a constant. The solution is to leave one out. If
all the regressors are dummies (no constant), the betas represent the average value of each
regressor.
For example:
, where M is a male dummy and F a female dummy with values
0 or 1. If we then have 4 participants, with 2 males and 2 females, then the male dummy is 1, 1, 0
and 0. The female dummy is hence 0, 0, 1, 1. If we recall when multicollinearity happens (if you
can express one regressor as a linear combination of the remaining regressors), we see that when
we combine both dummies, we always get the value of 1. Because β0 (constant) is always
multiplied by 1, this 1 and the sum of both dummies is the same, which therefore causes perfect
multicollinearity.
As already mentioned, the solution to multicollinearity is to leave out one of the regressors that
causes the problem:
,In the first example, if the subject is a female, β1 is 0 (because male dummy = 0) → yi = β0 + εi
If it’s a male → yi = β0 + β1 + εi → so to calculate the male average salary, you use the female
average + what comes extra to get the male’s salary (top-up).
These dummies are called level dummies because they only change the intercept of the
regression. We also have slope dummies, which change the slope of the regression. To illustrate
level and slope dummies, imagine that we want to explain ln(HousePrci) by various factors like
the province in which the house is located and the square meters. There are 5 provinces and
𝑗
therefore the dummy variables are 𝐷𝑖 with j = 1…5.
𝑗
The correct linear regression is → 𝑙𝑛(𝐻𝑜𝑢𝑠𝑒𝑃𝑟𝑐𝑖 ) = 𝛼 + ∑5𝑗=2 𝛽𝑗 𝐷𝑖 + 𝛽6 𝑆𝑀𝑖 + 𝜀𝑖 , notice
that we exclude one dummy to avoid multicollinearity. The α represents the dummy that is left
out, in this case the first province. The interpretation of for example β2 is that if the house is from
province 2, the house price increases by α*100% + β2*100%, given that the square meters stay
the same. We include α because β2 is a top-up effect and we multiply by 100% because it is a
log-linear function (see table above).
Now imagine we want to find out if the impact of one extra m2 is the same for all provinces
(slope dummies).
𝑗 𝑗
Then we get → 𝑙𝑛(𝐻𝑜𝑢𝑠𝑒𝑃𝑟𝑐𝑖 ) = 𝛼 + ∑5𝑗=2 𝛽𝑗 𝐷𝑖 + 𝛽6 𝑆𝑀𝑖 + ∑5𝑗=2 𝛾𝑗 𝐷𝑖 𝑆𝑀𝑖 + 𝜀𝑖 . Here we
also exclude one slope dummy to avoid multicollinearity. Again, the dummies are a top-up effect
of the reference category (with the SM dummies, the reference category is the one that represents
the first province, in this case 𝛽6 𝑆𝑀𝑖 ). The effect of the SM of province 2 on the housing price
will be then 𝛽6 ∗ 100% + 𝛾2 ∗ 100% , given that the rest stays the same.
, Assessing model adequacy
R2 gives the percentage of explained variation in the dependent variable by using the regression
model. In other words, how well the fitted values match the true values. The fitted values are the
values on the fitting line (linear regression). R2 never decreases if you add a variable. However,
we prefer models with less variables. For that, we use the adjusted- R2, which only appreciates
new variables if they enter with sufficiently high t-values.
Generalizations of the adjusted r-squared are model selection criteria, which are useful to
compare different models. The lower these criteria, the better the fit of the model:
As we can see, the AIC has 2(K/N). BIC has (K/N)ln(N), so it replaces the 2 for a ln(N). Because
ln(N) is higher than 2 for ≥ 8 observations, the BIC punishes the number of observations (N)
much more than the AIC. K is the number of parameters including the constant and ∑𝑁 2
𝑖=1 𝜀̂𝑖 is
the sum of squared residuals (SSR).
Outliers
In the real world, having no outliers is impossible. When having vertical outliers, the slope
changes slightly, but it is not that bad.