ECONOMETRICS
WEEK 1
SLIDES:
Probability density function/marginal distribution of a random variable: function containing the
probabilities of different outcomes, denoted f(xi) = Pr (X = xi)
Discrete pdf: pdf for countable outcomes. All outcomes of X have a non-negative probability
of occurring.
Continuous pdf: pdf for non-countable outcomes. All outcomes of X have a non-zero
probability of occurring.
Expected value: Rules of calculation
Rule 1: When X is a constant with value c: E(c) = c
Rule 2: When a constant c is added to X: E(X + c) = E(X) + c
Rule 3: When X is multiplied by constant c: E(cX) = cE(X)
Rule 4: When random variables X1 and X2 are assumed: E(X1 + X2) = E(X1) + E(X2)
E∑J Xj = ∑J EXj
Variance: rules of calculation
Rule 1: When X is a constant with value c: VAR(c) = 0
Rule 2: When a constant c is added to X: VAR(X + c) = VAR(X)
Rule 3: When X is multiplied by a constant c: VAR(cX) = c2 VAR(X)
Rule 4: When (pairwise) independent random variables are summed: VAR(X1 + X2)= VAR(X1) + VAR(X2)
VAR (∑j Xj ) = ∑j VAR(Xj )
Rule 5: When dependent random variables are summed: VAR(X1 + X2)=VAR(X1)+ VAR(X2) + 2COV(X1X2)
VAR (∑j Xj ) = ∑j ∑k COV(Xj ,Xk )
Conditional distribution: is the distribution of a random variable X conditional on a specific value of
another random variable G. It is defined as the ratio of the joint distribution over the marginal
f(xi , gj )
distribution: f(X|G) = f(gj )
.
Two random variables (X and G) are independent if the distribution of each variable is unaffected by
any particular outcome the other variables takes on: the joint distribution is equal to product of
marginal distributions: Pr (X=xi, G=gj) = Pr(X=xi) * Pr(G=gj)
Consequences of independence:
- The conditional distribution is equal to marginal distribution: Pr (Pr (X=xi, G=gj) = Pr(X=xi)
- The covariance and correlation between the random variables is zero:
COV (X,G) = CORR (X,G) = 0
Conditional expectations: rules of calculation
Rule 1: When X is multiplied by a constant with value x: E(cX|G = 0) = cE(X|G=0)
E(cX|G = 1) = cE(X|G=1)
Rule 2: When X is multiplied by a function of G, h(G): E(h(G)X|G = 0) = h(0)E(X|G) = 0)
H(0) G=0, H(1) G=1 E(h(G)X|G = 1) = h(0)E(X|G) = 1)
Covariance: rules of calculation
Covariance = a measure of linear association between X and G (DIA 65)
Rule 1: Covariance between X and a constant c: COV(X,c) = 0
Rule 2: Covariance between aX and bG: COV(aX,bG) = abCOV(X,G)
Rule 3: Covariance between X and X: COV(X,X) = E(X – EX)2 = VAR(X)
DIA’S appendix slides weten?
1
,WEEK 2
H2: ORDINARY LEAST SQUARES (OLS)
1. Estimating single-independent-variable models with OLS
̂𝑠 so as to minimize
Ordinary least squares: a regression estimation technique that calculates the ß
the sum of the squared residuals.
Residual: difference between the actual Ys and the estimated Ys produced by the regression.
Why do we use OLS?
OLS is relatively easy to use.
The goal of minimizing the sum of the squared residuals is quite appropriate from a
theoretical point of view.
OLS estimated have a number of useful characteristics.
o The sum of the residuals is exactly zero.
o OLS can be shown to be the “best” estimator possible under a set of specific
assumptions.
2. Estimating multivariate regression models with OLS
Multivariate regression coefficient: indicated the change in the dependent variable associated with
a one-unit increase in the independent variable in question holding constant the other independent
variables in the equation.
BUT not holding constant any relevant variables that might have been omitted from the
equation.
Total sum of squares (TSS): econometricians use the squared variation around its mean as a measure
̅) 2
of the amount of variation to be explained by the regression: ∑Ni=1(Yi -Y
TSS has two components, variation that can be explained by the regression and variation that
cannot:
o Explained sum of squares (ESS): ∑i (Y ̂i - ̅Y) 2
2
o Residual sum of squares (RSS): ∑𝑖 𝑒𝑖
This is usually called the decomposition of variance (figure 3, page 49).
4. Describing the overall fit of the estimated model
R-squared: simplest used measure of fit. It is the ratio of the explained sum of squares to the total
sum of squares: ESS/TSS = 1 – RSS/TSS = 1 - ((∑𝑖 𝑒𝑖2 )/ ∑(Yi -Y̅) 2 .
2
The higher R is, the closer the estimated regression equation fits the sample data. It
measures the percentage of the variation of Y around Y̅ that is explained by the regression
equation.
Since OLS selects the coefficient estimates that minimize RSS, OLS provides the largest
possible R2 , given a linear model. Since TSS, RSS, and ESS are all nonnegative, and since ESS ≤
TSS, R2 must lie in the interval 0 ≤ R2 ≤ 1, a value of R2 close to one shows an excellent overall
fit.
o R2 = 0 > horizontal line.
o Figure 5, page 53: this kind of result is typical of a time-series regression. In time
series data, we often get a very high R2 because there can be significant time trends
on both sides of the equation. In cross-sectional data, we often get low R2 because
the observations (say, countries) differ in ways that are not easily quantified.
o Figure 6, page 53: reported equations with R2 equal to one should be viewed with
suspicion: they very likely do not explain the movements of the dependent variable Y
in terms of the causal proposition advanced, even though they explain them
2
, empirically. This caution applies to economic applications, but not necessarily to
those in fields like physics or chemistry.
Simple correlation coefficient (r): a measure of the strength and direction of the linear relationship
between two variables.
It turns out that r and R2 are related if the estimated equation has exactly one independent variable.
The square of r equals R2 for a regression where one of the two variables is the dependent variable
and the other is the only independent variable.
A major problem with R2 is that adding another independent variable to a particular equation can
never decrease R2 . Adding a variable can’t change TSS, but in most cases the added variable will
reduce RSS, so R2 will rise. You know that RSS will never increase because the OLS program could
always set the coefficient of the added viable equal to zero, thus giving the same fit as the previous
question (and R2 will stay the same).
The lower the degrees of freedom, the less reliable the estimates are likely to be. Thus, the increase
in the quality of the fit caused by the addition of a variable needs to be compared to the decrease in
the degrees of freedom before a decision can be made with respect to the statistical impact of the
added variable. To sum, R2 is of little help if we’re trying to decide whether adding a variable to an
equation improves our ability to meaningfully explain the dependent variable. Because of this
problem, econometricians have developed another measure of the quality of the fit of an equation:
2
adjusted R2 , R̅ , which is R2 adjusted for degrees of freedom:
2
∑ei ⁄(N-K-1)
2
R̅ =1- ∑(Yi-Y
̅ ) 2⁄(N-1)
It measures the percentage of the variation of Y around its mean that is
explained by the regression equation, adjusted for degrees of freedom. It can be used to
compare the fits of equations with the same dependent variable and different numbers of
2
independent variables. Because of this property, most researchers automatically use R̅
instead of R2 when evaluating the fit of their estimated regression equations.
H3: LEARNING TO USE REGRESSION ANALYSIS
1. Steps in applied regression analysis
Once a dependent variable is chosen, it’s logical to follow this sequence:
1. Review the literature and develop the theoretical model.
2. Specify the model: select the independent variables and the functional form.
3. Hypothesize the expected signs of the coefficients.
4. Collect the data. Inspect and clean the data.
5. Estimate and evaluate the equation.
6. Document the results.
3
,SLIDES:
Minimize the sum of squared residuals: this can be done by taking the first order partial derivative
wrt ß̂ 0 and ß̂ 1 , by setting the derivatives to zero to find the minimum.
OLS also gives us a measure of the average size of a residual, in units of Y: root mean squared
error/standard error of the regression.
ß̂ 1 < ß1 : overprediction, positive residual
ß̂ 1 > ß1 : underprediction, negative residual
Unbiasedness of OLS: assumptions
All these 4 assumptions are needed for OLS to be an unbiased estimator:
1. Population model is linear in parameters (and the error term is additive)
2. Error term has a zero population mean: E(Ɛi) = 0
This assumption is met as long as a constant (ß0) is included in the model.
This is because the constant will always absorb any non-zero mean of the error term.
Not including a constant leads to biased estimates.
3. All independent variables are uncorrelated with the error term: CORR(Ɛi, Xi) = 0.
This assumptions states that the X variables have to be exogenous.
Most important assumption: without it, our estimates do NOT have a causal
interpretation.
Omitted variable bias is the most important reason why this assumption can fail.
4. No perfect (multi)collinearity between independent variables (and no variable is a constant).
Bivariate case: the OLS solution for ß̂ 1 shows that X cannot be constant, i.e. cannot
COV(Xi, Yi)
have VAR(Xi) = 0: ß̂ 1 = .
VAR(Xi)
Multivariate case: we additionally have that independent variables cannot be perfect
linear functions of each other (no perfect (multi)collinearity).
This assumption is only about perfect collinearity, high correlations between
variables do not violate this assumption.
Assumption 2 and 3 can be written as E(Ɛi|Xi1, Xi2 … Xik) = E(Ɛi) = 0
OLS estimator is unbiased if the expected value of the estimates produced by the estimator equals
the population parameter. That is, for some population parameter θ: E(𝜃̂) = θ.
So if we had many different samples from the population, and we would use the OLS estimator to
calculate the estimate 𝜃̂ in each of these samples, the average value of these estimates would equal
the population value θ.
An estimator is unbiased if the average value of the estimator in an infinite number of samples
equals the population parameter.
An estimator is consistent if the estimator converges to the population parameter as the size of the
sample tends toward infinity.
4
,Variance of the OLS estimators: assumptions
We can obtain an unbiased estimate of VAR(ß̂ ) using OLS, if assumptions 1-4 hold, as well as:
5. No serial correlation: errors are not correlated with each other across different observations,
CORR(Ɛi, Ɛj) = 0 – this is mostly important for timeseries.
6. No heteroskedasticity: error term has constant variance, VAR(Ɛi) = σ2 (where σ2 is a constant).
We don’t need this assumption to have E(ß ̂) = ß, but we do need it to have an
unbiased estimate of the error variance, VAR(Ɛi).
This unbiased estimator of the error variance, we in turn need for an unbiased
estimate of the variance of ß̂, VAR(ß̂).
Under all 6 assumptions, OLS is BLUE: the estimator with the smallest variance (i.e. most efficient)
among linear unbiased estimator.
Under these 6 assumptions, the variance of the OLS estimate of ßk is:
VAR(Ɛi ) σ2
̂k) =
VAR(ß 2 n = 2
(1-Rk ) ∑i=1 (Xki -X̅ k ) 2 (1-Rk ) TSSXk
σ2 = variance of the error term
TSSXk = total sum of squares for independent variable Xk.
Rk2 = R2 from an auxiliary regression of Xk on all other independent variables.
The larger VAR(ß̂k), the larger the ‘sampling uncertainty’ – that is, the higher the chance our found ß
̂k
in any particular sample is far from the true population ßk.
The sampling uncertainty is smaller:
- The smaller σ2 : that is, the less error
o Problem: we can’t observe σ2 , we observe the residual, we can use these to
construct an estimate of σ2 :
∑𝑛 𝑒 2
𝜎̂ 2 = 𝑛−𝑘−1
𝑖=1 𝑖
o Hence we obtain:
∑𝑛 2
𝑖=1 𝑒𝑖
VAR(ß ̂k) =
(𝑛−𝑘−1)(1-R2k ) TSSXk
o The standard error of the estimated parameter is:
̂
𝜎
se(ß̂k) = 𝜎̂ß̂k =
√(1-R2k ) TSSXk
- The larger TSSXk: that is, the more variation in Xk
- The smaller Rk2: that is, more variation in Xk that is not shared with other regressors
o A higher Rk2 (close to 1) indicates that much of the variation in the independent
variables Xk is shared with other independent variables included in the regression.
o This means that, when Xk changes, the other independent variables often also
changes – it is therefore difficult to find the effect of Xk on Y while holding the other
X variables constant (=partial effect).
o This phenomenon is called multicollinearity: unlike for perfect multicollinearity,
multicollinearity does not cause bias in our OLS estimation, but it does increase their
variances.
5
,WEEK 3
H5: Hypothesis testing
1. What is hypothesis testing?
The first step in hypothesis testing is to state the hypotheses to be tested. This should be done before
the equation is estimated because hypotheses developed after estimation run the risk of being
justifications of particular results rather than tests of validity of those results.
Null hypothesis: a statement of the values that the researcher does not expect.
Alternative hypothesis: a statement of the values that the researcher expects.
Since the regression coefficients are only estimates of the true population parameters, it would be
unrealistic to think that conclusions drawn from regression analysis will always be right. There are
two kinds of errors:
- Type 1 error: We reject a true null hypothesis.
- Type 2 error: We do not reject a false null hypothesis
Decreasing the probability of a type 1 error means increasing the probability of a type 2 error.
Decision rule: method of deciding whether to reject a null hypothesis. Typically, a decision rule
involves comparing a sample statistic with a preselected critical value. A decision rule should be
formulated before regression estimates are obtained.
2. The t-test
The t-statistic is the appropriate test to use when the stochastic error term is normally distributed
and when the variance of that distribution must be estimated. The t-tests are usually done on the
slope of the coefficient; for these, the relevant form of the t-statistic for the kth coefficient is:
̂𝑘 −ß𝐻0 )
(ß
tk = 𝑆𝐸(ß̂𝑘 )
The level of type 1 error is also called the level of significance. The level of significance indicates the
probability of observing an estimated t-value greater than the critical t-value if the null hypothesis is
correct. It measures the amount of type 1 error implied by a particular critical t-value.
We recommend using a 5-percent level of significance except in those circumstances when you know
something unusual about the relative costs of making type 1 and type 2 errors.
Confidence interval
Confidence interval: a range that contains the true value of an item a specified percentage of the
time. This percentage of the confidence associated with the level of significance used to choose the
critical t-value in the interval. For an estimated regression coefficient, the confidence interval can be
calculated using the two-sided critical t-value and the standard error of the estimated coefficient:
Confidence interval = ß ̂ ± tc * SE(ß
̂)
p-values
p-value: the probability of observing a t-score that size or larger (in absolute values) if the null
hypothesis is true. It tells us the lowest level of significance at which we could reject the null
hypothesis. P-values are printed out for two-sided alternative hypotheses. If your test is one-sided,
you need to divide the p-value in your regression output by 2.
Reject H0 if p-value < the level of significance and if ß̂ has the sign implied by H1.
3. Examples of t-tests
READ
6
, 4. Limitations of the t-test
The t-test does not test theoretical validity
Sometimes, there is by chance a common trend on both sides of the equation, this common trend
does not have any meaning.
The t-test does not test “importance”
Statistical significance indicates the likelihood that a particular sample result could have been
obtained by chance, but it says little – if anything – about which variables determine the major
portion of the variation in the dependent variable. To determine importance, a measure such as the
size of the coefficient multiplied by the average size of the independent variable or the standard
error of the independent variable would make much more sense.
The t-test is not intended for tests of the entire population
All the t-test does is help to decide how likely it is that a particular small sample will cause a
researcher to make a mistake in rejecting hypotheses about the true population parameters. If the
sample size is large enough to approach the population, then the standard error will fall close to zero
because the distribution of estimates becomes more and more narrowly distributed around the true
parameter. If the sample size is large enough, you can reject almost any null hypothesis.
6. Appendix: The F-test
F-test: a formal hypothesis test that is designed to deal with a null hypothesis that contains multiple
hypotheses or a single hypothesis about a group of coefficients. Such “joint” or “compound” null
hypotheses are appropriate whenever the underlying economic theory specifies values for multiple
coefficients simultaneously.
1. Translate the particular null hypothesis in question into constraints that will be placed on the
equation. The resulting constraint equation can be thought of as what the equation would
look like if the null hypothesis were correct.
2. Estimate this constraint equation with OLS and compare the fit of this constrained equation
with the fit of the unconstrained equation. If the fit of the unconstrained equation is
significantly better than that of the constrained equation, then we reject the null hypothesis.
F-statistics:
RSSM -RSS)M
F= RSS/(N-K-1)
RSSM is always greater than or equal to RSS.
For the F-test of overall significance, the equation simplifies to:
ESS/K ∑(Ŷ i - Y̅ ) 2 /K
F= = ∑ e2i /(N-K-1)
RSS/(N-K-1)
In this case, the “constrained equation” to which we’re comparing the overall fit is Yi = ß0 + Ɛi which is
nothing more than saying Ŷ i = Y̅. Thus the F-test of overall significance is really testing the null
hypothesis that the fit of the equation isn’t significantly better than that provided by using the mean
alone.
Other uses of the F-test
Double-log function form: one of the properties of a double-log equation is that the coefficients can
be used to test for constant returns to scale. It can be shown that a Cobb-Doulas production function
with constant returns to scale is one where ß1 and ß2 add up to exactly 1, so the null hypothesis to be
tested is: H0 = ß1 + ß2 = 1
Economic theory suggest that the slope coefficient of a Cobb-Douglas production function should be
between 0 and 1, if this is not the case (as p170) we should be extremely cautious.
READ: p.169/170
7
, Seasonal dummies: dummy variables that are used to account for seasonal variation in the data in
time-series models.
Inclusion of a set of seasonal dummies ‘deseasonalizes’ Y. This procedure may be used as long as Y
and X4 are not ‘seasonally adjusted’ prior to estimation. To test the hypothesis of significant
seasonality in the data, one must test the hypothesis that all the dummies equal zero simultaneously
rather than test the dummies one at a time: use the F-test. If the hypothesis of seasonal variation can
be summarized into a single dummy variable, then the use of the t-test will cause no problems.
READ: p171 (only the test).
H6: Specification: Choosing the independent variable
Specifying an econometric equation consists of three parts:
1. Choosing the correct independent variables
2. Choosing the correct functional form
3. Choosing the correct form of the stochastic error term
Leaving a relevant variable out of an equation is likely to bias the remaining estimates, but including
an irrelevant variable leads to higher variances of the estimated coefficients. We suggest trying to
minimize the number of regressions estimated and relying as much as possible on theory rather than
statistical fit when choosing variables.
1. Omitted variables
If a variable is omitted, then it is not included as an independent variable, and it is not held constant
for the calculation and interpretation of ß ̂k. This omission can cause bias: it can force the expected
value of the estimated coefficient away from the true value of the population coefficient.
The consequences of an omitted variable
The major consequence of omitting a relevant independent variable from an equation is to cause
bias in the regression coefficients that remain in the equation. If you omit X2 from the equation the
Ɛ*i = Ɛi + ß2 X2i . The included coefficients almost surely pick up some of the effect of the omitted
variable and therefore will change, causing bias.
Most pairs of variables are correlated to some degree, even if that correlation is random, so X1 and X2
almost surely are correlated. When X2 is omitted from the equation, the impact of X2 goes into Ɛ*, so
Ɛ* and X2 are correlated. Thus if X2 is omitted from the equation and X1 and X2 are correlated, both X1
and Ɛ* will change when X2 changes, and the error term will no longer be independent of the
explanatory variable. That violates assumption 3! So, is we leave an important variable out of an
equation, we violate assumption 3, unless the omitted variable is uncorrelated with all the included
independent variables.
To generalize for a model with two independent variables, the expected value of the coefficient of an
included variable when a relevant variable is omitted form the equation equals:
E(ß1) = ß1 + ß2 * α1
Where α1 is the slope coefficient of the secondary regression that relates X1 to X2:
X2i = α0 + α1X1i + ui
Since the expected value of an unbiased estimate equals the true value, the right-hand term in
equation 5 measures the omitted variables bias in the equation:
Bias = ß2 * α1
Bias = ßomf(rin,om)
f(rin,om) -> function of the correlation between the included and omitted variables.
This bias exists unless:
1. The true coefficient equals zero
2. The included and omitted variables are uncorrelated
8