Statistics 1
Tutorial 1 – population, sample, variables,
frequency tables
Smoking during pregnancy can cause problems for mother and child. Adverse outcomes like underweight.
RQ: Question that we want to answer
Example: Do children whose mothers smoked during pregnancy often have a lower birth weight?
Population: every member of a group (persons, objects, etc.) for which we would like to collect information.
Example: all pregnant women in the Netherlands in 2017
Sample: part of the population that we will study and collect information for. Often too expensive or time
consuming to study whole population, so we draw a sample.
Example: selection of pregnant women in the Netherlands in 2017
Random selection procedure: representative of the whole population
Units: the elements of a sample from which we collect the information.
Example: pregnant women
Variable: measured property of an element of the sample. Generally things we measure e.g. height, weight, hair
colour.
Examples: weight at birth of baby, education level of pregnant women
Quantitative variable: (continuous/discrete)
- Height, weight at birth, yield (continuous)
- Number of children in household, number of diseased plants in a field, number of cigarettes each day
for pregnant women (discrete) You can’t have halves, they are wholes.
Qualitative variable: (nominal/ordinal)
- Hair colour, bachelor program, province, place of residence (nominal) Something you measure, but you can’t
put an order in it/rank them
- Grade of eggs, highest level of education completed, annual salary (ordinal) When you can rank them: ordinal
Exercises:
1.1: Height (continuous), weight (continuous), eye colour (nominal), sex (nominal), hair colour (nominal), number of siblings
(discrete), head circumference (continuous). Babies are the units.
1.2: A: Units: full-grown cows of a certain breed. Quantitative variables: weight
B: Units: new born twins. Quantitative: height, weight. Qualitative: sex
Drawing a sample from a population
We want to draw conclusions about the population, so sample should be representative for the population.
Sampling bias: certain parts of the population might be overrepresented as compared to other parts. Example:
polls US election. Obama competing for presidency. Calling people and asking people who they vote. Polling stations got it wrong, they
contacted people through home phones, which 23% didn’t have. Research was done on people with landline, these were more old,
wealthy and conservative so they vote other party. Therefore: sampling bias through the landline.
Recommended sampling methods:
1. Simple Random Sampling (SRS): units drawn at random from population. Every unit in population has
the same probability to end up in your sample. Example: drawing 4/20 business cards from a box lottery system.
Sampling bias avoided by offering everyone equal chance and probability.
Ground rule: every sample should have the same chance of ending up on your sample.
,Exam question: 1) not random sample, as not all crates have the
same probability of apples to be picked. 2) still not a simple
random sample. Both not simple random samples.
Every sample of a certain size has to have the same chance to end
up in sample. Every object/subject in population must have same
chance to end up in the sample. Not the case? Not SRS.
Other things that could go wrong in sampling:
1. Undersampling: certain groups are excluded from the
sample, e.g. all women that did not give birth in hospital, due to a received hospital list of women giving birth in hospital
2. Non-response: not participating, or not successfully contacted
3. Voluntary participation (in survey): might result in particularly positive/negative answers. Survey
received in restaurant: people who are very positive or very dissatisfied are more likely to participate than average people.
4. Response bias: social desirability bias (self-reported personal traits, questions about income, mental health, alcohol)
Observational research
Observational: observe the unit/process without influencing it (looking, feeling, etc.)
Example: consequences of smoking during pregnancy is an observational study. You can’t do an experiment with that.
You can’t theoretically draw hard cause-effect conclusions. The effect of smoking can also be due to confound
(external effects).
Experimental research
Experimental: apply a treatment to the unit in order to observe a reaction.
Example: randomization of 20 experimental plots, you assign a wheat variety to 10 of the plots and another to the other 10. Then you
determine what the difference is.
Cause-effect relationship can only be concluded from an experimental study: here you change only what you
want to investigate, the others factors are the same. This gives you opportunity to conclude a causal effect.
Exercises:
1.3: A: households with welfare support in a particular city
B: 400 households from that city with welfare support
C: Welfare support, number of children, living in the city
D: Observational study
Frequencies:
Frequency is how often something occurs, can be women in the pregnancy example.
Also applicable to discrete variables with a limited number of outcomes.
In comparison, it is better to determine the relative frequency. Dividing frequency by the total number
available. Gives you a fraction, e.g.: 172/945 = 0,18 fraction. So percentage of 18% of women that took primary
education.
, Tutorial 2 – numerical summary of data &
probability
PART A: Numerical data
Eating too much salt is not healthy, important to know how much salt is in our food. If you want to avoid too
much salts, we need to know the amounts that are in our foods. Therefore: investigating salt in bread.
Population: all loaves of bread sold in the NL (on one particular day)
Units: bread loaves
Variable: amount of salt (g/100g) in bread
Sampling design: 1) Simple random sample from all supermarkets and bakeries. 2) draw one loaf of bread randomly from each of the
selected supermarkets/bakeries. Two stage cluster sample, not SRS.
Note: know for the exam whether or not a certain sample is a SRS or not. You don’t need to remember the
specific study designs.
Central tendencies (data): use the mean or median:
Mean: calculate average: alle nummers optellen en delen door aantal nummers. Streepje indicates mean.
Median: middle/mid-point value if you order all observations from small to large. 50% of observations will be smaller,
50% of observations will be larger. Also called 50th percentile.
With an equal number, you get the two middle values and calculate the average, e.g. 5,5.
Symmetric distribution: difference of the mean and median is very little.
Asymmetric distribution: difference of mean and median is large.
Due to effect of outliers: Median is not sensitive to outliers, the mean is very sensitive to outliers (uitschieters).
Measures of variability:
How are your observations distributed around the mean? Indication of the spread of the data through
measures: standard deviation & range.
Standard deviation: is the √(square root) of the variance. BOEK LEREN.
Variance: look at differences between observations and the mean, square² this
to add up all differences. Otherwise the positive and negative difference will cancel each other
out. The squared differences are added up and divided by n-1.
If you want to take the standard deviation, you take the squared route of the
variance. And if you want to calculate the variance through the SD, take the
standard deviation and square² it.
Interquartile range IQR = Q3 – Q1
Put all observations in order from low to high.
Q1 = 1st quarter of the data = 25 th percentile = lower
quartile
Q3 = 3rd quartile = 75th percentile = upper quartile
Q2 = idem, but called the median.