Week 1: OLS, Dummies, model adequacy, assumptions and inference/testing.
Goals for the week:
• Understand what the classical linear regression model is and how it can be used in empirical
finance.
• Know key concepts: estimation, inference, estimator, estimate, parameters, dummy
variables, outliers.
• Understand what assumptions are needed for valid inference in the CLRM and why.
• Understand the t and F test, and model adequacy measures such as the R-squared and
adjusted R-squared
The model and notation (vectors and matrices)
When we write down a model, we can use vector notation. We can gather the parameters of the
model into a parameter vector and gather the individual characteristics into a vector.
When we have these vectors, we can write down the model in a very short form with the first
standard notation like such:
Additionally, when we would stack all observations we can utilize the second standard notation
(matrix notation) for the linear regression model:
,Linear regression, transforming variables:
A linear regression model needs to be linear in the parameters.
Transforming variables by taking the log, is a good tool to make it linear, to make it less skewed or to
overcome problems such as heteroskedasticity. It is nice to first plot the data to see if it is skewed,
has heteroskedasticity issues or if the distribution is non-linear, so you can see if you need to take the
log.
,When you take the log in a regression the interpretation of the parameters change from an absolute
interpretation to a relative interpretation:
Dummy variables can be used to disentangle the effect of different groups of observations (e.g. male
vs female) on a dependent variable. A level dummy leads to a variation in the level of the dependent
variable between groups (the ‘average’) and a slope dummy leads to a variation in the impact of an
independent variable on the dependent variable (it is an interaction effect between the dummy and
another independent variable). Be aware of the dummy trap, which means that you cannot include
all dummy categories and an intercept, since then there would be perfect multicollinearity. Thus you
need to either include all dummy categories and no intercept or you need to leave a reference
category for the dummy categories (whose effect will be incorporated into the intercept).
If we take an example of a level dummy, when we want to explain the average salary of females vs
males, we can take as a reference category the female dummy, the intercept will contain the average
female salary and the beta 1 will contain the male top-up. You can see the other examples below:
We can also give an example of a slope dummy in practice. When we think that the effect of age on
the CEO salary differs between industries, we can use the model below to test our hypotheses.
, ̅ 𝟐 ), AIC, BIC and outliers
Assessing model adequacy: R-squared, adjusted R-squared (𝑹
However, R-squared will never decrease if you add a variable, thus the adjusted R-squared may be
better to use. This measure includes a punishment for adding more variables which do not explain
additional variance in the dependent variable (they have low t-stats).
Other methods to assess the quality of a model is to look at so-called model selection criteria. These
are for example the AIC, HQ and BIC. These measures take the squared residuals of your model, the
number of parameters and the number of observations and calculates a goodness-of-fit measure.
The measures are shown below and when the model is adequate these measures will be rather low
(the squared residuals are low, thus the log of these will also be low). The term at the end introduces