Summary Everything you need to know about modules 1, 2, 4 and 5(x) Methods of Empirical Analysis
27 views 2 purchases
Course
Methods of Empirical Analysis (MANMEC027)
Institution
Radboud Universiteit Nijmegen (RU)
Within this document you will find everthing you need to know to be prepared for the exam of Methods of Empirical Analysis. It includes handy lists, short summaries of important literature and lectures, warnings, output in R and much more.
1. All variables must be measured at interval level and without measurement error.
Error in Y not problematic as it is addressed by adding the error term, error
in X is. Leads to underestimating the coefficient of the X variable. Data
collection should have been done better.
Violated with nominal or ordinal data, one should address this by adding
dummies.
2. The mean value of the error term is 0 for each value of X.
Not really a problem as the program will draw the line where the mean
value of the error term is 0. So that there is no constant over- or
underestimation, no general average positive or negative residual.
3. Error terms are homoscedastic
The variance should be the same for each value X
Consequence if violated: LUE instead of BLUE, standard errors of
parameters are biased and statistical tests thus not reliable.
You can detect heteroscedasticity by inspecting a plot or using a Breusch-
Pagan test.
The null hypothesis (H0) = there is no heteroscedasticity
The alternative hypothesis (H1) = there is heteroscedasticity
Since, the value is lower than alpha of 0.05. There is no heteroscedasticity.
Solutions if heteroscedasticity is detected:
o Robust standard errors (White’s heteroscedasticity-consistent).
Adding an extra margin of error. So significance is less easily being
detected.
o Generalized least square estimator. We tell the model that variance
changes by values of X (for example the larger X is, the larger the
variance becomes). Then again we have a BLUE model.
4. Error terms are not correlated (no autocorrelation)
You should not be able to predict the next error term.
, Likely causes: predictor missing, cluster sampling (having values from one
class, and from another but not taking that into account. The children
could have had different teachers, better education).
Solution for cluster sampling: multilevel data.
Note, often with time series data there is autocorrelation. That needs to be
addressed, see module 2.
5. Each independent variable is uncorrelated with the error term (omitted variable
bias)
Causes: functional form is wrong (so assuming linearity, where there is
none, assuming direct effect where there is an interaction effect, omitted
variable bias). The first two are theory-based, the third one can be because
you do not have data about that variable. Additionally, if the omitted
variable is correlated with both the dependent and independent variable,
the coefficient of that independent variable will be stronger than its real
effect.
6. No independent variable is perfectly linearly related to one or more of the other
independent variables in the model (multicollinearity).
Such a relation between independent variables would lead to an increase in
standard errors of the coefficients. Thus the estimate becomes less precise
and the estimate might be sensitive for adding a few new observations.
(note that R2 remains the same).
You can detect multicollinearity by looking at the correlation between two
independent variables, but there could be correlation between multiple
independent variables. Thus better is applying VIF/TOL.
o This would mean running a regression between an independent
variable and all other independent variables. A high R2 then
indicates multicollinearity. The VIF is calculated by 1/(1-R 2), so the
higher the R2 the higher the VIF. A VIF greater than 5-10 indicates
multicollinearity. Related is TOL, 1/VIF. So 0.2-0.1 TOL would
indicate multicollinearity.
Solutions: increase sample size, delete one of the involved variables (with
dummies, always remove the dummy).
7. Error terms are normally distributed for each X value.
Not very important as estimates remain rather robust.
Additional assumptions:
1. Values of Y are linearly dependent on the independent variables.
Not really a problem, only interpretation.
Be aware, you should not add polynomials for the sake of better fitting the
model. As with twenty polynomials the line is all over the place, thus
losing its relevance.
2. Parameters of the model should have the same value for all individuals.
, Does not hold where there is an interaction effect. To create such an effect
multiply two variables with each other.
o Note that creating an interaction variable often means that you
would have to center the interacting variables first. The reason is
that if you would like to look at the effect of educ only, then year
should be 0, which would not make sense always. You might want
to look at the effect of educ when the other variable has a mean
value. Income = β0 + β1*educ + β2*year + β3*EducYear
- Note furtherly that the mean with categorical data does not
make sense, so we only center the mean of variables on
interval or ratio level
If met, then BLUE, meaning the best linear unbiased estimates. So best, indicating
smallest variance of the parameters and correctly calculated standard error terms. Linear
means the independent variables influence the dependent variable linearly and unbiased
estimates means that the coefficients in the model represent those of the population.
The error term indicates the difference between the actual observed values and the
theoretical values gained from the theoretical relation. The model has a non-random
component, how independent variables influence the dependent variable, based on
theory. And a random component, which is the error term. Thus the error term
indicates:
- Omitted variables
- Random human behaviour
- Approximation errors
The residuals indicate the difference between the observed values and the estimates
values.
Note: We can only observe the residuals, thus we can use them for testing the
assumptions or the goodness-of-fit of the model. Etc.
Least squared principle. The sum of the squared residuals is minimized.
Influential case: an observation that has a strong influence on the regression coefficients
(so with large datasets this would be less the case). It can be measured by DFFIT,
difference between prediction of Y with and without an observation.
- Only remove them if the influence is disproportionally large and make a strong
case why you should remove it. Only remove one influential case at the time, as it
could already have solved the disproportionate influence on the coefficients.
Outliers: individual observations for which the model fits badly (large residual).
Dummy variables are variables resembling data measured on nominal level (country,
city, religion, no order between them) or ordinal level (education, social class,
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller rjhvrinzen. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $9.12. You're not tied to anything after your purchase.