MAT22306 - Quantitative research methodology and statistics
Lecture 1.1
Data types and distributions:
Variables must be able to vary (have different values), e.g. gender (can be male/female). Male is not a variable, as it
cannot vary. Male is a level of variable.
Types of variables:
Categorical/nominal: there’s no order or magnitude. Solely distinguishes between levels.
Ordinal: distinguishes between levels, fixed order. Clear order, no clear magnitude/difference between the values.
Interval: distinguished between levels and values, with a fixed order and there’s equal distance from the differences.
Ratio: distinguished between levels and values, with a fixed order. Distances are equal, but now there’s a natural zero
Describing findings of variables:
Categorical: reporting in percentages or frequencies (56 oranges, 60 apples)
Ordinal: reporting in percentages or frequencies.
Interval: infinitely many options (infinite categories). Report in summary measures for mean, central tendency, and
width of distribution.
Ratio: infinitely many options (infinite categories). Report in summary measures for mean, central tendency, and width
of distribution.
Measures of central tendency:
How to summarize groups of people with one measure? Describe the typical/average income in group
Mode: most common occurrence. Measure of centrality
Median: middle person
Mean: what is the average?
In a normal distribution, all central tendency measures are the same.
Measures of distribution:
Shows the difference/spread in the sample, used with percentiles (%) or % ranges
Standard deviation: the average distance from the average.
Formula: sum (each individual observation – overall mean) ² / total nr of observations. So,
(squared difference between the value of an observation minus the mean).
Sum of Squares (SS): for every score you have, you calculate the difference to the mean (obs –
mean), and square it. Add all of these up. The more observations, the > the sum.
Variance: independent variation from the number of observations around mean. Formula:
Sum of squares / total number of observations.
Normal distribution notation: N(μ, σ)
Standard normal distribution (z-distribution) notation: N(0, 1). μ = 0 σ = 1. → Tabel Field p. 995-998.
Standard normal distribution: number of standard deviations
from the mean. Number: how much of the total observations
is lower than the z-value?
Rules of thumb normal distribution:
Generally, 50% is lower than the mean.
68% is between + and – 1 standard deviation. 1 SD from the
mean, means 2/3 of the sample (68%), etc
,Kurtosis: indicates the pointiness (how high the top value) is of the distribution. Three possibilities: Leptokurtic = very
high point.
Mesokurtic = normal
Platykurtic = flattened.
Lack of symmetry: skewness. Can be tricky as
the mean can no longer be used as a central
tendency value of the data.
Positive skewness = longer tail towards positive
values
Negative skewness = longer tail towards
negative values.
Checks for normal distribution/normality:
1) Histogram: does it look like a bell-shaped curve/ND?
2) Boxplot: median is given, around that box of 50% of all observations. Symmetric in box and whiskers? Whiskers
(uiteinden) should capture about 95% of the values.
3) Q-Q plot: are the predicted residuals under normality the same as the observed residuals (difference between
mean)? Ideally all residuals should be on the straight line.
Fixing non-normality:
Many real world situations have a lowest possible value of 0, e.g. income, distance, time spent on task. Then you get
a positively skewed distribution (figure above), which is called log-normal. In cases where it makes sense to think
about doubling distance or times (e.g. spending 1 or 2 secs on a task, or 1 or 2 minutes), then you can calculate the
logarithm of such a scale. Then the skewed data could transforms to a normal distribution.
Sample and population:
Population = every case of interest
Sample = part of the population, which we try to generalize to the population at large
Population estimates require random samples. Inferential statistics: making population claims based on sample.
Estimate values for population through sample:
μ: sample mean (M or 𝑥̅ ) is an estimate for population mean (μ)
σ: sample SD (s) is an estimate of population SD (σ). N-1 is a correction for small samples
Sample distribution (bell figure) will become narrower when the sample is larger. Meaning,
the larger the number of observations, the better the sample mean is an estimate of the population.
Standard error of the mean (SE): the standard deviation of the sample distribution. Larger sample, smaller SE.
Estimator formula: sample standard deviation / square root N.
,Lecture 1.2
Sample distribution: is normally distributed around the population mean, with SD called standard error (σ/√𝑛).
Standard error = the standard deviation of the sampling distribution.
When one sample is outside the e.g. 95% range, we conclude it does not belong to H0. (alpha = 0.05). Meaning, it is
unlikely that the sample was drawn from a population that had that actual population mean mu.
Significance only indicates whether there’s evidence for a difference, however small. We conclude that something
does not belong to a general population. Says little/nothing about relevance.
Transform data to a z-distribution:
(Sample mean – population mean) / standard deviation of the sampling distribution.
After getting the sample z-value, the new sample distribution follows the N(0,1).
Z-distribution
T-distribution
Estimate SE of population through SE of sample. Calculate
standard error of the sample by taking the standard
deviation and divide by square root n. The smaller the
sample, the flatter the t-distribution.
Difference in critical values: 95% z-distribution is always + - 1.96. In a t-distribution this depends on the number of
observations if that number becomes larger. → book p. 999-1000
Df (degrees freedom): number of total observation – number of parameters used to estimate situation.
T-distribution has heavier tails, a bit flatter than the ND (more probability over extreme ranges). How flat/heavy the
tails is determined by df. The t-distribution becomes standard normal (z-)distribution if df becomes infinite.
Assumptions t-distribution:
• Data is measured on interval or ratio scale
• Observations follow the normal distribution
• Based on independent observations.
The more observations (df), the steeper t gets. Especially with a
small group < 20, than the t is really different from the z.
Rule inferential statistics: we can only conclude something at a
given confidence, not 100% certain. We decide the confidence.
Type 1 and Type 2 error
Type 1 error: when in reality the null hypothesis is true, but we
reject it. Incorrectly conclude something is going on, while it’s not.
Type 2 error: something is going on, but we didn’t see it based on
sample. Beta depends on effect size, # observations, alpha (acceptance
for type 1.
Problem: The more critical on not having false positives (type 1, alpha),
the larger the chance that we miss something (type 2, beta). We want to have more compelling evidence.
, In sum:
α (alpha) = critical p-value: proportion of sample where we accept that if less than 5% of samples is beyond the point
we accept, it is probably not part of the null hypothesis.
Test statistic = calculated value (z or t). We have to find a reference point; critical t-value found with df.
Confidence interval = range in which a specific value is likely to be with given confidence. Complement of alpha: 1 – α
Rejection region= outcomes for the test statistic where we conclude H0 is not true (reject H0, support Ha). Dit is dan
buiten de 95% curve. De Rejection Region zijn de Test Statistic uitkomsten die buiten de level of significance/alpha vallen. Als je dus 0.10 en
two-sided hypothesis, heb je een rejection region van 0.10, met aan de linkerkant 0.05 en de rechterkant 0.05. One-sided: 0.10 aan die zijde.
Rejecting and accepting H0:
Outcome probability > alpha: we accept H0, Ha has not been shown
Outcome probability < (of gelijk) alpha: we reject H0, Ha has been shown
Statistical test-procedure: