Tutorial 1
MAT-15303
Population, sample,
variables, frequency table
Research questions (examples):
Do children mothers smoked during pregnancy often have a lower birth rate?
Do women with less education smoke more often during pregnancy than?
Population (example):
Population
Sample (example)
Selection of pregnant women in the Netherlands in 2017
Definitions of population and sample
Population: every member of a group (persons, objects, etc) for which we would like
to collect information
Sample: part of the population that we will study and collect information for
Why sample?
Too expensive or time consuming to study all pregnant women, therefore we
draw a sample.
Units and variables
Units:
The elements of a sample which at we are gathering information
In general: humans, animals, plants, fields…
Here: pregnant women
Variable:
Measured property of an element of the sample
In general: height, weight, hair colour, yield…
Here: highest level of education completed by the pregnant woman
Here: weight at birth of a baby
Variables
Quantitative variable (continuous/discrete)
Height, weight at birth, yield (c)
Number of children in a household, number of diseased plants in a field,
number of cigarettes each day for a pregnant woman (d)
Qualitative variable (nominal/ordinal)
Hair, colour, bachelor program, province, place of residence (n)
Grade of eggs (AA/A/B), highest level of education completed, annual salary
(in ordered categories) (o)
,Drawing a sample from a population
How do we draw a sample in order to be able to base conclusions for the
whole population on the results from this sample?
(Sampling) bias: certain parts of the population might be overrepresented as
compared to other parts
Good/recommended way for sampling
Simple Random Sampling (SRS)
In SRS, units are drawn at random from a population. Every sample (of a
certain size) has equal chance to be selected (and every unit from a
population has the same chance to be selected into the sample).
Example: draw blind 4 business cards from a box of 20
Example: from all women who smoked during pregnancy we draw SRS (e.g.
by assigning a number to all women and randomly draw 100 numbers from
these)
SRS avoids sampling bias by selecting elements from a population at random (and
with equal chances to be selected).
SRS is a relatively simple/easy way to draw a sample.
SRS: elke steekproef van grootte x heeft zelfde kans om gekozen te worden
Undersampling
Certain groups are excluded from the sample (e.g. all women that did not give birth in
the hospital)
Non-response
Not participating, or not successfully contacted
Voluntary participation (in a survey)
Might result in particularly positive or negative answers
Response bias
Social desirability bias (self-reported personal traits, questions about income)
Observational vs experimental research
Observational study: observe the unit/process without influencing it
A study of the consequences of smoking during pregnancy is an observational
study
Experimental study: apply a treatment to the unit in order to observe a reaction
Assume we have 20 (similar) experimental plots in a field. After
randomization 10 plots are assigned to wheat variety A. on the remaining 10
plots we sow wheat variety B. Determine the difference in yield between
variety A and B.
Cause-effect relationship can only be concluded from an experimental study.
Qualitative variables - table - (relative) frequency
Also applicable to discrete variables with a limited number of outcomes
Frequency table for the highest level of education completed for pregnant women.
, Tutorial 2
MAT-15303
Numerical summary of data:
measures of center and dispersion probability,
law of large numbers, consistency
Units, variables, study design
Assume we are investigating salt in bread
Population: all loaves of bread, sold in the Netherlands (on one particular day)
Units: loaves of bread
Variable: amount of salt (g/100 g) in bread
Sampling design: simple random sample from all supermarkets and bakeries,
subsequently draw one loaf of bread at random from each of the selected
supermarkets/bakeries (two stage cluster sample, not SRS!)
We often see sampling strategies that are much more complex than SRS, the
two stage cluster sampling above is an example thereof (this is not relevant for
the exam, but you have to be able to judge whether a sample is a SRS or not)
Measures for centre:
y 1
Mean: y + y +1
¿
2…+ y n
= Σi
¿
1
n
= n
∑ ¿ yi
y i
n
Median M:
Order of data from the smallest to the largest
The median is the midpoint/value where at most 50% of the observations will
be smaller and at most 50% of the observations will be larger (also called: 50 th
percentile)
Measures of variability
Standard deviation = s = √variance
S 2
Variance: ¿
1
n− 1
Σ
2
( y i− y )
Measure for variability:
Range = max - min
Sx sample sd ∑ ( y− y )
2
n−1
Interquartile range IQR = Q3 – Q1
Q1 = 1st quartile = 25th percentile = lower quartile
Q3 = 3rd quartile = 75th percentile = upper quartile
Interquartile range is not sensitive to outliers in contrast to the variance (and
therefore in contrast the standard deviation)
Percentiles
The pth percentile of a set of n ordered observations (from smallest to largest) is the
value where at most p% of the observations are smaller than it and at most (100-p)%
of the observations are larger.