Basic Statistics
Week 1:
Supplementary Subject Matter
Par1.
A population consists of units (people, goats, etc.)
Par 2.
Experimental design:
- Random sample: Select a number of units of the population at random.
- Experimental study: Data are collected under controller circumstances. Je verandert iets en kijkt
naar wat de reactie is.
- Observational study: 2 verschillende groepen hetzelfde geven en naar de gevolgen kijken. Je kunt
geen oorzaak-gevolg situatie schetsen. Je kunt zelf niks veranderen aan hoe de situatie is, alleen
observeren.
- Confounder= een factor die ook van invloed zou kunnen zijn.
- design: a ‘rule’, a recipe, for the way in which the units and data are to be collected → Hoe meer
factoren een rol spelen, hoe ingewikkelder het design wordt.
Par 3.
Variables: measured property of an element of the sample
- Random variables: door toeval bepaald
- Qualitative variables:
- nominal variables: Kunt ze niet ranken. Nummers hebben geen betekenis
- Ordinal variables: Kunt ze ordenen. Nummers hebben betekenis
- Quantitative variables (is a count):
- Discrete variables: zijn altijd hele getallen. Nooit halve getallen of iets achter de komma.
- Continuous variables: Kunnen ook halve getallen zijn.
Statistical analysis of the outcomes of variables
→ estimate a relative frequency (percentage)
→ how accurance is this estimated value?
→ whether the relative population frequency or population mean exceed some critical value.
Tutorial 1:
Doing research:
step 1: Research question
Step 2: Determine which population you need for this research
Step 3: Take a sample of this population
- sampling bias: Certain parts of the population might be overrepresentated as compared to other
parts.
SRS (simple random sampling): iederen heeft een even grote kans om in de sample te komen. Er mag
dus ook nooit een pre-selectie gemaakt worden.
Other things that could go wrong with sampling:
1. Undersampling: Certain groups are excluded from the sampling
2 Non-response: Not participating, or not successfully contacted
3 Voluntary participation: might result in particularly positive or negative answers. Als ze zelf mogen
bepalen of ze meedoen met bv een vragenlijst, vullen vaak alleen mensen hem in die extreme
positief zijn of extreme negatief.
4 Response bias: Social desirability bias (self-reported personal traits, questions about income)
,Tutorial 2:
Mean= g = gemiddelde
- Sensitive for outliers (Extremely high or low number)
median = alle getallen van laag naar hoog opschrijven → middelste getal is median = M
→ Bij een even aantal nummers neem je het gemiddelde van de 2 cijfers die in het midden staan.
- not sensitive for outliers
Symmetric distribution : Mean + median (bijna) gelijk
Asymmetric distribution: Mean + median verschillen heel erg. → Bevat outliers
Measures of variability:
- Range= maximum – minimum
- Standard deviation = s = Wortel(variance)
Variance= S2 = (1/(n-1) * de som van (Yi – Y)
- Interquartile range (IQR)= Q3-Q1
Q1 = first quartile = 25th percentile = lower quartile
Q3 = third quartile = 75th percentile = upper quartile
Is not sensitive for outliers
*Zie schrift voor een schema met minimum, q1, median, q3, maximum*
The pth percentile of a set of n ordened observations (from smallest to largest) is the value where at
most p% of the observations are smaller than it and at most (100-p)% of the observations are larger.
*Zie schrift voor een voorbeeld hiervan*
Five number summary
1. The sample minimum (smallest observations)
2. The first quartile = Q1
3. The median (or Q2)
4. The third quartile = Q3
5. The sample maximum (largest observations)
Accuracy= nauwkeurigheid
Law of large numbers: Relative frequencies stabilize if an experiment is repeated very often.
Probability: Relative frequency ‘in the long run’.
Statistical notation:
n= sample size; number of persons in the sample
y= number of persons consuming too much salt
p= probability that a randomly chosen person consumes too much salt per day.
Estimator for p=y/n → consistent estimator
The larger the sample size, the closer y/n gets to the unknown value of p
Random phenomena: a phenomena determines by chance
Random variables: a variable whose numeric result originates from a random phenomena
1. Discrete random variables: hele getallen
Probability for discrete random variables consist of:
- set of all possible outcomes S
- List of probabilities P for all possible outcomes.
2. Continuous random variables: halve getallen zijn ook mogelijk
,Definition of probability by LaPlace: If a random phenomenan has k possible outcomes that are all
equally likely:
p (event A)= number of outcomes in A / total number of possible outcomes
vb. Dobbelsteen werpen. Event A = uitkomst is een even getal.
P(even getal) = 3/6 = 0,5
Statistical events
A = complement = all outcomes that are not occurring in A
A B = union = outcomes that occur either only in A or only in B or in A & B simultaneously
Disjoint event / mutually exclusive event = if event A & B do not have joint outcomes
A B = intersection = consist of all outcomes that are part of A as well of B
*Voor duidelijke voorbeelden hiervan zie schrift!*
Probability properties
0 < P(A) < 1 for any event A
P (s) = 1 S = all possible outcomes.
Complement rule
P(A) = 1 – P(A) for each event A
Boxplot: Bepaal median, min, max, Q3 en Q1
Bepaal IQR
Box = IQR. Lijn in het midden van de box = median.
WEEK 2
Tutorial 3
Work of Gregor Mendel: Gregor Mendel, who is known as the "father of modern genetics", and his
colleagues at the monastery did study the variation in pea plants. By hand, he ponated this plant and
he saw for instance that if you cross a plant who produce yellow peas with a plant who produce
green peas, you get an whole generation who produce yellow peas. If you cross these plants with
each other, you end up with green peas showing again. There must be something with dominant and
recessive.
, Tree diagram
Laws of probability theory
- Multiplication law for independent evens:
P (A^B) = P(A) x P(B)
- If A and B are independent events: if the fact that A occurs does not affect the probability
of B occurring
- The multiplication law for independent events can also be used to define that events A and
B are independent.
- Addition law for disjoint events
all the probabilities for example a heterozygous offspring can be added up and that’s the probability
that the offspring is heterozygous.
P(A u B) = P(A) + P(B), if A and B are disjoint events
Example:
A = Male parent passing on Y, female parent passing on y.
B = Male parent passing on y, female parent passing on Y.
P(A u B)= P(A) + P(B) = 0,25 + 0,25 = 0,50
- General addition law
Considering sets of events that are not disjoint.
P(A u B) = P(A) + P(B) – P(A ^ B)
If A and B are disjoint (A^B = impossible), so P(A^B) = 0. Addition law for disjoint events follows.
Example:
A = male parent passing on Y
B = female parent passing on Y
P(A u B) = P(A) + P(B) – P(A^B) = 0,5 + 0,5 – 0,25 = 0,75