Econometrics
Week 1
Econometrics is the quantitative measurement and analysis of actual economic and business
phenomena, trying to bridge the gap between the abstract economic world and the real
world. Econometrics has three uses:
Describing economic reality
Testing hypotheses about economic theory and policy
Forecasting future economic activity
A typical regression equation: Y = b0 + b1X + epsilon
b0 is the intercept/constant term.
b1 is the slope coefficient.
b0 and b1 together is the deterministic component of the equation (=mean value of Y
with a given X), but there also is a stochastic component.
Epsilon is the stochastic error term, explaining all of the variation in Y that cannot be
explained by the included Xs.
The error term is necessary because:
Omitted influences on Y.
It is impossible to avoid measurement error.
The theoretical equation might have a different functional form (/shape) than the one
chosen.
Human behavior is at least a bit unpredictable and random.
If the regression equation is multivariate, there are multiple Xs in it:
Yi = b0 + b1X1i + b2X2i + … + bKXKi + epsiloni, where i goes from 1 to N and there are in total K
independent variables. The error term also is different per person in the true regression.
Estimated (=fitted) numbers and variables contain a ^. The residual ei = Yi – Yi^, the lower the
residual the better the fit of the equation. The residual isn’t the same as the error term
epsilon = Yi – E(Yi|Xi), which is a theoretical concept that never could be observed, while the
residual is a real-world value. The residual is a kind of the estimate of the error term.
Cross-sectional data: all observations are from the same point in time and represent different
individual economic entities like countries or houses from that same point in time.
OLS = Ordinary Least Squares: minimizes the sum of all squared residuals. It’s the most
popular estimator, as:
It is relatively easy to use.
The goal of OLS (minimizing the sum of all squared residuals) is theoretically
appropriate.
It has some useful characteristics: the sum of the residuals always is exactly zero and
OLS is the best under its assumptions.
Root (=wortel) MSE (=mean squared error) = the average size of the residual, in units of Y.
OLS estimates b1 if there is only one independent variable as:
,b1 = (sum of (Xi – X-bar) * (Yi – Y-bar)) / (sum of (Xi – X-bar)2), and you can then just calculate
b0.
X-bar and Y-bar are the means of X and Y.
To measure the adequacy of an estimated regression, you can calculate the total sum of
squares TSS = sum of all (Y - Y-bar)2, calculating the variation of Y. Some part of this variation
is explained by the regression and some variation cannot. So TSS = ESS + RSS
ESS = explained sum of squares = sum of all (estimated Y – Y-bar) 2 variation explained by
the regression line
RSS = residual sum of squares = sum of all squared residuals unexplained variation.
The smaller the RSS relative to the TSS, the better the regression line fits the data. OLS
minimizes RSS and therefore maximizes ESS for a given TSS.
You can measure how well an equation fits the truth by R2 = the coefficient of determination
= ESS/TSS = 1 – RSS/TSS = 1 – (sum of squared residuals) / (sum of (Y – Y-bar) 2). The higher R2,
the closer the equation fits the sample. These measures are ‘goodness of fit’ measures. R 2
measures the percentage of the variation of Y around Y-bar that is explained by the
regression equation, so OLS maximizes R2 (given a linear model), laying in between 0 and 1.
In time-series data, R2 often is quite high, but in cross-sectional data, we often get low R2s
because the observations differ in ways that are not easily quantified. 0.5 could be a good fit.
If it equals one, it is suspicious, as they are very likely not to explain the dependent variable
in other samples from the sample used to create the equation. The fit of an equation is just
one measure to measure overall quality of the regression. It always should be theoretically
correct first, as there is spurious regression: correlation by coincidence in one period.
The problem is that adding a new independent variable never leads to a lower R2, so you
should also look at degrees of freedom = N – (K + 1)
N being number of observations, K being number of coefficients and 1 for also using the
intercept. The lower the degrees of freedom, the less reliable the estimates are likely to be.
So a rise of R2 should be compared to the decrease of the degrees of freedom, which is done
by the R2-bar = ((sum of all e2) / (N – K – 1)) / ((sum of (Y – Y-bar)2) / (N – 1)), measuring the
percentage of the variation of Y around its mean that is explained by the regression
equation, adjusted for degrees of freedom. Unlike R2, it can also be slightly negative.
After choosing a dependent variable, you should:
1. Review the literature and develop a theoretical model.
2. Specify the model by selecting independent variables and the functional form.
3. Hypothesize the expected signs of the coefficients.
4. Collect the data and inspect and clean it; all variables should have the same number
of observations, frequency, and time period. The units of measurement of the
variables (whether you use dollar or thousands of dollars) doesn’t matter. And results
should be realistic, so you should check them, for example by plotting a graph and
look for outliers.
5. Estimate and evaluate the equation, by using OLS. You only move to step 6 if you are
satisfied with the estimated equation, otherwise, you should go back to step 1 and
start over.
, 6. Document the results.
Dummy variable trap: using both the categories of the dummy variable of values 1 and 0 as
two dummies. For example setting a dummy for being man and one for being woman. If you
have three variables, you can use 2 dummy variables. If both are equal to zero, you have the
third variable. But be careful in interpreting this; the coefficient of this dummy measures the
impact compared to the omitted condition, which is when both dummies are zero.
A random variable is a variable whose numerical value is determined by chance. A discrete
random variable has a countable number of possible values, and a continuous random
variable (such as time) can take any value in an interval.
A probability distribution shows the probability per possible discrete random variable.
A continuous variable is more difficult to use. You should use intervals instead of every
individual measurement. In a continuous probability density curve, you can display these
interval probabilities.
If the density function is symmetrical, the mean is in the center.
Expected-value maximizing has a weak part: it doesn’t take risk into account. As it focuses on
the results and the expected value on the long term, there for example is no difference
between a sure 1 million and 1% chance on 100 million, while in fact there is.
Variance = sigma2 = a weighted average of the squared difference between a variable and its
expected value, using the probability of each variable as weights.
Sigma = standard deviation.
Z = standardized random variable = (X – mean) / sigma.
The standardized random variable always has a mean of 0 and a standard deviation of 1.
Central limit theorem: if Z is a standardized sum of N independent, identically distributed
random variables with a finite, nonzero standard deviation, then the probability distribution
of Z approaches the normal distribution as N increases. Karl Gauss was an important man in
applying this distribution, so it’s also called the Gaussian distribution.
You can use a table to know the values of Z at different levels. But there are some values
good to know by heart. A normally distributed variable has about 68% chance of being within
one sigma of its mean, 90% chance of being within two sigma, and about a 99.7% chance of
being within three sigma.
OLS is the best estimator under some conditions, the Classical Assumptions:
I. The regression model is linear, is correctly specified, and has an additive error term. If
the function for example is nonlinear but exponential, you can make the formula
linear by relabeling the variables on both sides with ln.
II. The error term has a zero population mean. This can happen because the constant
term b0 is absorbing any difference from zero.