Quantitative methods
Lecture 1
- Correlation
- Regression analysis gives more than correlation analysis.
- Relationships between variables
-> dependent variable Y variable to be explained
-> independent variable X explanatory variable.
- Regress Y on X (terminology), so dependent on independent.
- Causal effect is often hypothesized, but not necessarily positive and
negative effects.
- Bivariate relationship easiest to show in scatterplot.
- Slope of line is informative, intercept can be informative
-> with regression, we get everything we get from correlation, plus the
information from the line.
- Correlation is strength between linear association between X and Y.
-> strongly correlated imaginary line, those points will be close to the
line
-> weakly correlated imaginary line, points will be further away from the
line.
- Correlation coefficient (rho) or r
-> degree or strength of (linear) association between two variables.
-> is the standardized conversation (if one thing changes, how much does
the other thing change as well) between two variables X and Y.
-> standardization with respect to scale (variation in X and variation in Y)
cov X ,Y
- r=
var X∗var Y
- Covariation is how moving away from the average in X relates to how
moving away from the average in Y. (Variation in X along with variation in
Y, wit respect to the average)
-> variation in X or Y—> for each of the point in univariate distribution,
what is the average distance from the average)
- Covariance (X, Y) = sum of product of deviances in X and Y for all data
points i
- Variance (X) = sum of squared deviances in X
- Variance (Y) = sum of squared deviances in Y
- FORMULE r
- Interpretation
-> ranges from -1 and 1
-> +1 means strong positive correlation or strong positive (linear)
relationship (draw a line, every point on the line)
-> -1 means weak negative correlation or weak negative (linear)
relationship (also perfect correlation, every point on a line)
-> 0 means no correlation or relationship.
, - Correlation coefficient only tells something about linear relationships.
-> there could be a relationship, but if its not linear, this coefficient doesn’t
help.
- Correlation between -1 and 1.
Lecture 2
- OLS and regression are the same (for now)
Difference of means test (hypothesis)
- Has Rotterdam become safer? (do residents of Rotterdam perceive greater
levels of safety?)
- On average across this sample, the average perception of safety is 7.3 in
2010 and 7,5 in 2022.
-> same survey repeated 12 years later, different (random) sample of
residents (everyone has an equal probability to be part of the sample)
-> most obvious thing is that people feel safer (7.5 > 7.3); however, in
statistics we want to make use information about the sample to say things
about the population.
- Even if we get a completely random sample and we have a big enough
sample, there is still a probability that the sample is not usable to say
something about the population.
- Sampling distribution of the difference in sample-means
(steekproefgemiddelde)
- For a sample of a fixed size, you can imagine every single possible sample
we could have taken from that population large number.
-> the whole population can be arranged in very many different
combinations of 16000 people in the sample.
Hypothesis testing in 6 steps
1. Ensure that assumptions are met.
2. Formulate hypotheses (in plain language, but also in a formula)
3. Determine the critical area from the appropriate sampling distribution.
-> in this case we are talking about the difference in sample-means.
-> how unusual is it to see something like this, given that there is some
randomness in the world? there is a probability that we get one weird
sample and one weird sample in another year.
-> even though there was no difference in the whole population, there is a
possibility that there is a difference between the samples.
4. Calculate the test statistic
5. Make decision
-> do we reject the null hypothesis or not?
-> we do not say that we accept the hypothesis.
6. State conclusions based on evidence we have.
,Step 1: Assumptions
- Assumptions:
-> random samples (every member of population has equal probability of
falling into our sample truly random sample is very difficult to obtain)
-> independent samples
-> interval-ratio level of measurement (we must be able to subtract one
thing from another)
-> sampling distribution (of a sample-means) is normally distributed
- There is a fixed number of possible samples of 16.000 that can be picked
from a population of 700.000
-> sample 1, get the average, sample 2, get the average, sample 3, get
the average, etc.
-> if you plot the distribution of all of the averages of all of those samples
the distribution is normally distributed, same goes for the distribution in
the difference of sample-means and the sampling distribution of regression
coefficients
- The sampling distribution of the difference in sample means.
-> is the difference weird enough that we should conclude that there
actually is a difference?
- If a sampling distribution looks like a theoretical probability distribution,
then we can make probability statements about samples taken from
populations.
- Any sample statistics has a sampling distribution – including regression
statistics.
- What is the probability that we would observe a particular sample statistic
(ex. Sample means) given this population?
-> how unusual is this? Critical region!
Step 2: formulate hypotheses
- What we think about the world is in the H 1 (alternative hypothesis)
- We need a null hypothesis that includes every other possibility always
no difference, no effect, the opposite of what we expect, etc.
Step 3: Determine the critical region
- What is the area under the curve from one point to
another?
- Say we are willing to be wrong 5% of the time, the
amount of weirdness we accept is 5%.
-> there is a 5% chance that we are wrong.
- Z-distribution (normal)
- Alpha = 0.05 (5%)
- Left-tailed test
, - Critical z-value = -1.64 (look it up)
- Decision rule:
-> reject H0 if Z* is less than -1.64 (more extreme)
-> alternatively (and equivalently): reject H0 if p < 0.05
Step 4: Calculate test statistic
- Formulas (don’t need to know) for standardizing the data
- Alternatively (if using p-value)
-> instead of asking what value is cutting of 5% of the distribution
-> how much of the distribution is cut off with a specific value? p-value,
compare with the 5%.
-> decision rule: reject H0 if p < 0.05.
Step 5 and 6: decision and conclusion
- Decision: H0 can be rejected, because -28,5 < -1,65
Bivariate regression
- A line is defined by its slope (by how much does Y change of a one-unit
change in X) and its intercept (point on Y-axis that the line intersects)
- What is the best line (intercept and slope)?
- Regression equation: Y = a + bX + e
-> minimizes