Summary Methods of Empirical Analysis
Module 1 – Introduction:
Empirical analysis = find useful patterns in data.
The four V’s of Big Data: volume (scale), variety (different forms), velocity (analysis of streaming
data), veracity (uncertainty of data).
Data science = hacking skills + math & statistical knowledge + substantive expertise
To see the effect of an independent variable on a dependent variable we use ordinary linear
regression (OLS). It tells us how independent variables are related to some dependent variable.
It is a description of the linear relationship between variables.
We cannot know the theoretical relationship, we can estimate the
empirical relationship, therefore we include the error term:
variation comes naturally.
ŷ = 𝑏1 + 𝑏2𝑥
𝑦𝑖 = 𝑏1 + 𝑏2𝑥𝑖 + 𝑒Ƹ𝑖
There is a theoretical model, predicting the Q’s, and we have
actual observations (the P’s). We extend the model to 𝑦 = 𝛽 + 𝛽 𝑥
+ 𝑒 to account for such deviations, with 𝑒 being the error term.
In reality, we don’t know the theoretical relationship (the Q’s), we use our observations (P’s) to
approximate the theoretical relationship. This is called the estimated model. Differences
between observed values and estimated values are called residuals. Thus: the error term is
defined as the difference between the actual observation and the non-random component (y =
b0 + b1x1) of the theoretical relationship. The residuals are defined as the differences between
the actual observation and the estimated values (ŷ = b0 + b1x1). We use these residuals to test
whether assumptions are met, to determine goodness-of-fit of the model and to calculate the
likelihood that model coefficients are different from zero.
The assumptions of OLS:
1. All variables must be measured at interval level and without error;
2. For each value of the independent variables, the expected error term should be 0;
3. Homoscedasticity: the variance of the data points is independent of x;
4. There is no autocorrelation (the error terms are not correlated);
5. Each independent variable is uncorrelated with the error term. If violated, we have
omitted variable bias;
6. There is no multicollinearity (you cannot explain one IV with another IV);
7. The conditional errors are normally distributed: ei | Xi ~ N(0, σ2).
Two additional assumptions:
8. The values of Y are linearly dependent on the predictors (IV’s);
9. Parameters of the model have for each individual (observation) the same value.
The OLS-regression line is the line where the sum of the squared residuals is minimized. This is
the Least Squares Principle. LSP determines the model coefficients b such that the sum of
squared residuals is minimized.
In a linear regression model that satisfies the OLS assumptions, the least squares estimator is the
Best Linear Unbiased Estimator (BLUE) of each linear combination of the observations.
Best = smallest variance
Unbiased = without error: the expected value of the parameter estimated by the model is equal
to its population value.
This BLUE-ness was found out in the Gauss-Markov Theorem.
With residual analysis we check how our model looks like:
1. Global evaluation of the model;
1
, 2. Determine the role of individual cases;
3. Check trustworthiness of statistical test outcomes.
We can use graphical instruments and numerical instruments (statistics that indicate the
presence of outliers and influential cases; indicators of dependencies among independent
variables). The best is to combine those two.
Graphical instruments:
- Plots
o Scatterplot à displays association between two variables;
o Partial plot à displays association between two variables, with controlling for
other variables in your model.
- Histogram à shows the density functions. Tells if the data is normally distributed or not.
It is not a problem if your data is not normally distributed, as long as your error term is
normally distributed.
Numerical instruments:
- Lever à how far removed is one value of the independent variable from all the other
values of this variable? Thus: how far is an individual value removed from the mean;
- Mahal à does the same;
- Cook’s distance D or DfFit à estimate all the parameters with the value that is the
potential outlier, and without it. This is the most important measure to identify outliers.
These methods are to check the dispersion of the variables. There are also commands to look at
the residuals (like ZRESID, SDRESID etc).
Outliers are cases extremely far away from the mean, influential cases will change the outcome
of the model.
We need to test the assumptions described above:
1. Variables must be measured at interval level and without measurement error. The points
should be perfectly on the line. Error in X is difficult to correct, error on Y is not
problematic, because it’s captured in the error term.
2. The mean value of the error term is 0 for each X value. If
violated: the relationship is not linear, more generally
speaking: there is a predictor missing.
3. Residuals are homoscedastic. Heteroscedasticity: if we
increase in age (X), the residuals increase. Problem: we
are overestimating the effect, model not BLUE anymore,
but LUE. You can detect this with an inspection of the
plots and the Breusch-Pagan test (White-test). Solution:
provide a weight/generalized least square estimator (weighted least squares: the values
with smaller variance count heavier) or do the test without using the distorted standard
errors: robust standard errors.
4. The residuals are not correlated, no autocorrelation. If violated, the cause of the problem
is often that an important predictor is missing, or that there is a cluster sample. The
solution for this is multilevel modelling.
2
, 5. Each independent variable is uncorrelated with the error term. If not, there is
specification error, the model is not correctly specified. This is often violated without
knowing it: how do you know that a variable is missing?
6. No independent variable is perfectly (nor approximately) linearly related to one or more
of the other independent variables in the model. If this is violated and there is an almost
linear relation between explanatory variables, we call this multicollinearity. The
consequence is that the standard errors will be larger than they should be. You can
detect it by looking at correlations, the VIF or tolerance score (1/VIF). A VIF greater than
5-10 or a TOL smaller than 0.2-0.1 indicates multicollinearity. Solutions for
multicollinearity: add new information (increase sample size) or delete one of the
involved variables.
7. Residuals are normally distributed for each X value. However, the larger your N
becomes, the less likely it is that this problem occurs.
So, to summarize, there are a few possible solutions when you detect problems in your data:
- Remove cases
You remove cases from your dataset and treat them as if they were never there. This can be
necessary if individual cases have a disproportionally large influence on the outcome of the
analysis. However, it is not needed with large datasets (>500 cases), because the influence of an
individual case is then generally neglectable. Remember: only influential cases need to be
removed, not outliers. Also, don’t remove more than one influential case at the same time.
- Transform variables
Be very careful with changing the dependent variable, because this influences coefficients of all
x-variables. If the relationship is in reality not linear, add regressors as new variables to the
model to have a better description of the relationship. This is called polynomial regression.
- Add new explanatory variables to the model
- Use other estimation techniques (robust)
- Remove variables or increase sample size (to overcome multicollinearity)
Dummy variables:
Use dummies if your data is
not interval or ratio level.
Create a dummy for every
category as 0 = not present, 1
= present. One dummy must
be left out of the model, this
is the reference category. See
example for interpretation à
Instead of defining dummies
with binary/dummy coding,
one can also use effect coding
(1, 0, -1) or contrast coding.
3