STATISTICS 1
Week 1 - Lecture 1 & 2
Statistics is a guessing game.
You never know the parameter/ the truth about the population, you only hope that you are close.
Population = The group that you wish to describe (The entire set of elements)
Sample = The group for which you have data (A subset of elements from the population,
taken with the intention of making inferences about the population)
Why take a Sample?
› Describing the whole population is:
• Too expensive
• Impossible
• Sampling might be destructive
• Impractical
• Unnecessary
Parameter = Numerical property of the population (based on the entire population/ the truth)
Statistic = Numerical property of a sample (based on a statistic)
Sampling error
› A difference between the value of a parameter and the statistic computed to estimate that
parameter
› Result of:
• Variability
• Sampling Bias
• Nonsampling Error
Reducing Sampling Error
› Variability (this Lecture)
- Increase n
› Sampling Bias (this Lecture)
- Design of sampling procedure
› Nonsampling Error
- Validity, Accuracy, Precision of variables
- Prevent coding errors
- Prevent interpretation errors
- Also: good labelling, metadata
➔ You do have control over variability, sampling bias and nonsampling error, you want to
minimalize them.
Variability = The phenomenon whereby repeated sampling from the same population results in
different values for the statistic.
Example; ask 5 students age in course group. Ask again with different 5 students. The difference in
average age. How different?
= variability (size and diversity important). Statistically you want it to be as low as possible, increase
confidence in result. Solution is increase sample size.
1
,Sampling distribution = Describes how the statistic varies when sampling is repeated.
- In other words: describes (extent of) variability
- This is the basis for inference
Central Limit Theorem
Even if a variable X is not normally distributed in the population …
› … we may assume that …
Under certain conditions, such as a large number of cases and a fixed standard deviation σ
› ... the Sampling Distribution of the mean is approximately normal with standard error:
Sampling Bias = Result of procedures which favour the inclusion, in your sample, of elements from
the population with certain characteristics. (make sure you have the right people in your sample)
› Sources of Sampling Bias: (a combination of) the
- population
- researcher
- research design
- research topic
- respondent
› May result in:
- incomplete coverage: relevant elements not in sampling frame
- nonresponse: refusal or missing data
➔ Increasing the sample size increases the problem.
Population, reductant to participate, don’t trust science.
Researcher, are we capable to see population?
Difference between probability and non-probability sample: who is taking the decisions.
2
,Probability samples: driven by chance + reduced sampling bias.
Non-probability samples: researcher is in charge + risk of bias.
Judgemental: handpicked who you research, suitability.
Volunteer: hey I wanna be in your research.
Convenience: laziness, only ask people who are there/queuing> easy and nowhere else to go.
Cluster (random): assumption that you have groups in your population that are similar. Then it
doesn’t really matter who you pick.
Stratified: opposite of cluster, different groups. Maybe different approaches per group.
Systematic (random): population already ordered, example; student numbers. Every 5th person etc.
Simple random: ideal case, perfect list same probability. Clear population + list + randomly selected.
Independent: small population, trick. Independent, keep probability the same to being selected. Take
them out, ask questions, put them back in the group.
Quota: Targets, find me 100 people of this kind, without intend of representative. Just about getting
the numbers. Not representative.
Simple random and convenience difference; most convenient way disregarding the population you
would like to cover. Simple random different approach, work hard to cover population and choose
from that. If lucky; convenience can be representative.
Example Public Transport Bureau = stratification; different groups of commuters. Clustered design in
stratified group possible. Not systematic, cause you leave out all the people without passes.
➔ Exam: which groups do you want to research/ define population and sample, are they
different? Work your way up which strategy you would choose, cover each group.
+ Definitions from the book. Don’t remember formulas. Pick right formula and apply.
Geographic sampling:
- Traverse samples; lines
- Quadrat samples; squares
- Point samples; dots
You want it to be random.
Processing of data
› How to deal with nonresponse
Distinguish:
• Choice of respondent
- Can still be regarded as a value
- “no opinion” still informs about the respondents opinion
- “don’t know” still informs about the reason of nonresponse
• Other causes
- “no answer” does not inform about the position of the respondent
Types of data
Qualitative (Non-numerical values)
› Categories
Quantitative (Numerical values (counts, measurements)
› Discrete; Range of possible values is limited (how many cars do you have, no commas)
› Continuous; Intermittent values are also possible (height, can be specific. Also averages, inhabitants
have an average of .5 cars; variable is number of cars per household, not specifically about cars or
inhabitants anymore.)
3
, Measurement levels
› Nominal
- Categorical, no ranking
› Ordinal
- Categorical, ranked (low-high, bad-good etc.)
- Degrees of a certain phenomenon
- Width of intervals unknown
› Ratio (& Interval) = scale in SPSS
- Width of intervals known (= equidistance)
- We can compute differences
Interval and ratio difference; ratio has a natural/absolute/true zero point.
Example; Celsius = interval (below zero no absence of temperature) and Kelvin = ratio.
Example grey colours: ordinal.
Example countries: nominal.
Example German political parties: nominal. Variable more specific; number of seats/ degree of
conservativeness makes it different.
Example satisfaction: ordinal. Opinion, width unknown.
Binary variables (a.k.a.: Dummy, or Boolean) (rules out the measurement levels = nominal)
› Two possible values: True or not true, yes or no, 1 or 0, agree or disagree.
› Special case of a nominal variable: Mean = proportion of “1”. > Possibility to calculate useful
average!
Choose suitable variables and measurement levels.
Exploratory Data Analysis
› Study data in order to describe key properties
- What do you see?
› For each variable
- Diagrams and / or tables
- Numerical summaries of distributions
› No single best way of doing EDA
- BUT: the starting point of any decent quantitative analysis!
Distributions (> quality control, does the variable do what it is supposed to do)
› Shape
› Center
› Spread
4