Summary of áll the lectures, computer practicals and seminars of Statistics 1 in the year 2022/2023. This includes personal notes and examples + figures from the PowerPoint presentations.
Statistics is a guessing game.
You never know the parameter/ the truth about the population, you only hope that you are close.
Population = The group that you wish to describe (The entire set of elements)
Sample = The group for which you have data (A subset of elements from the population,
taken with the intention of making inferences about the population)
Why take a Sample?
› Describing the whole population is:
• Too expensive
• Impossible
• Sampling might be destructive
• Impractical
• Unnecessary
Parameter = Numerical property of the population (based on the entire population/ the truth)
Statistic = Numerical property of a sample (based on a statistic)
Sampling error
› A difference between the value of a parameter and the statistic computed to estimate that
parameter
› Result of:
• Variability
• Sampling Bias
• Nonsampling Error
➔ You do have control over variability, sampling bias and nonsampling error, you want to
minimalize them.
Variability = The phenomenon whereby repeated sampling from the same population results in
different values for the statistic.
Example; ask 5 students age in course group. Ask again with different 5 students. The difference in
average age. How different?
= variability (size and diversity important). Statistically you want it to be as low as possible, increase
confidence in result. Solution is increase sample size.
1
,Sampling distribution = Describes how the statistic varies when sampling is repeated.
- In other words: describes (extent of) variability
- This is the basis for inference
Central Limit Theorem
Even if a variable X is not normally distributed in the population …
› … we may assume that …
Under certain conditions, such as a large number of cases and a fixed standard deviation σ
› ... the Sampling Distribution of the mean is approximately normal with standard error:
Sampling Bias = Result of procedures which favour the inclusion, in your sample, of elements from
the population with certain characteristics. (make sure you have the right people in your sample)
› Sources of Sampling Bias: (a combination of) the
- population
- researcher
- research design
- research topic
- respondent
› May result in:
- incomplete coverage: relevant elements not in sampling frame
- nonresponse: refusal or missing data
➔ Increasing the sample size increases the problem.
Population, reductant to participate, don’t trust science.
Researcher, are we capable to see population?
Difference between probability and non-probability sample: who is taking the decisions.
2
,Probability samples: driven by chance + reduced sampling bias.
Non-probability samples: researcher is in charge + risk of bias.
Judgemental: handpicked who you research, suitability.
Volunteer: hey I wanna be in your research.
Convenience: laziness, only ask people who are there/queuing> easy and nowhere else to go.
Cluster (random): assumption that you have groups in your population that are similar. Then it
doesn’t really matter who you pick.
Stratified: opposite of cluster, different groups. Maybe different approaches per group.
Systematic (random): population already ordered, example; student numbers. Every 5th person etc.
Simple random: ideal case, perfect list same probability. Clear population + list + randomly selected.
Independent: small population, trick. Independent, keep probability the same to being selected. Take
them out, ask questions, put them back in the group.
Quota: Targets, find me 100 people of this kind, without intend of representative. Just about getting
the numbers. Not representative.
Simple random and convenience difference; most convenient way disregarding the population you
would like to cover. Simple random different approach, work hard to cover population and choose
from that. If lucky; convenience can be representative.
Example Public Transport Bureau = stratification; different groups of commuters. Clustered design in
stratified group possible. Not systematic, cause you leave out all the people without passes.
➔ Exam: which groups do you want to research/ define population and sample, are they
different? Work your way up which strategy you would choose, cover each group.
+ Definitions from the book. Don’t remember formulas. Pick right formula and apply.
Geographic sampling:
- Traverse samples; lines
- Quadrat samples; squares
- Point samples; dots
You want it to be random.
Processing of data
› How to deal with nonresponse
Distinguish:
• Choice of respondent
- Can still be regarded as a value
- “no opinion” still informs about the respondents opinion
- “don’t know” still informs about the reason of nonresponse
• Other causes
- “no answer” does not inform about the position of the respondent
Types of data
Qualitative (Non-numerical values)
› Categories
Quantitative (Numerical values (counts, measurements)
› Discrete; Range of possible values is limited (how many cars do you have, no commas)
› Continuous; Intermittent values are also possible (height, can be specific. Also averages, inhabitants
have an average of .5 cars; variable is number of cars per household, not specifically about cars or
inhabitants anymore.)
3
, Measurement levels
› Nominal
- Categorical, no ranking
› Ordinal
- Categorical, ranked (low-high, bad-good etc.)
- Degrees of a certain phenomenon
- Width of intervals unknown
› Ratio (& Interval) = scale in SPSS
- Width of intervals known (= equidistance)
- We can compute differences
Interval and ratio difference; ratio has a natural/absolute/true zero point.
Example; Celsius = interval (below zero no absence of temperature) and Kelvin = ratio.
Example grey colours: ordinal.
Example countries: nominal.
Example German political parties: nominal. Variable more specific; number of seats/ degree of
conservativeness makes it different.
Example satisfaction: ordinal. Opinion, width unknown.
Binary variables (a.k.a.: Dummy, or Boolean) (rules out the measurement levels = nominal)
› Two possible values: True or not true, yes or no, 1 or 0, agree or disagree.
› Special case of a nominal variable: Mean = proportion of “1”. > Possibility to calculate useful
average!
Choose suitable variables and measurement levels.
Exploratory Data Analysis
› Study data in order to describe key properties
- What do you see?
› For each variable
- Diagrams and / or tables
- Numerical summaries of distributions
› No single best way of doing EDA
- BUT: the starting point of any decent quantitative analysis!
Distributions (> quality control, does the variable do what it is supposed to do)
› Shape
› Center
› Spread
4
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller Enya96. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $8.09. You're not tied to anything after your purchase.