Experimental Design and Data Analysis Course VU
Tim de Boer
February-March 2021
1 Lecture 1
Contents: recap of statistical concepts.
• The normal density curse is given by:
1 1 2
/σ 2
fµ,σ (x) = √ exp− 2 (x−µ)
2πσ 2
Where µ determines the position of the peak on the x-axis and σ determines the width of density curve.
• A quantile of α is number qα such that P (X ≤ qα ) = α. The upper quantile is the other side,
P (X > qα ) = α. The quantile value qα is the median value if we choose α = 0.5. If α = 0.25, qα is the
value such that the data is split in 25 below and 75% above qα . The function qnorm() aims to find the
boundary value, A in P(X < A), given the probability P. For example, suppose you want to find the
85th percentile of a normal distribution whose mean is 70 and whose standard deviation is 3. Then
you ask for:
qnorm(0.85,mean=70,sd=3)
• A QQ-plot can reveal whether data follows certain distribution P . It plots the theoretical ordered
probabilities from normal distribution on the x-axis (theoretical quantiles) versus the from sampling
obtained quantiles on y-axis. Linear line means correct distribution (sampled from the population).
Use the qqnorm plot together with a histogram to see if the data is normally distributed. Also plot an
boxplot to get an idea about differences and spread.
par(mfrow=c(1,2)); qqnorm(data); hist(data); boxplot(hours~environment)
• Central Limit Theorem: Sampling from a (not known) distribution and calculating mean for these
sample means. The distribution of all these sample means is more normally distributed. If you
keep sampling from the unknown distribution and keep calculating the mean of these samples, the
distribution of means becomes more and more normally distributed. The higher sample size, the
better normally distributed. When a sample is taken from the distribution N(µ, σ 2 ) then the sample
mean is N(µ, σ 2 /n): another way of describing the Central Limit Theorem: the sample mean varies
less than original mean.
• In a real dataset, the full population std σ is unknown. We replace σ with sample std we call s which
gives the T-distribution as a sort of Central Limit Theorem if we take the mean of samples:
X̄ − µ
T = √
s/ n
which does not have N(0,1) distribution due to uncertainty about full population. Instead, T has
t-distribution with n - 1 degrees of freedom.
1
, • A point estimate for a unknown parameter (for example the mean) is a function of only the observed
data, seen as a random variable. Denote them with µ̂.
• The confidence interval of 1 − α, e.g. 95%, is a random interval based only on the observed data that
contains the true value of the parameters with probability of 95%. If σ is unknown (which is true in
almost all cases), the t-confidence interval becomes [X̄ − t, X̄ + t] e.g. how confident are we that that
true proportion is is in between 2 std’s from the sample proportion p̂. If we want to calculate a 95%
confidence interval for a normal distributed ppopulation, we have to calculate the 97.5th percentile:
σ
CIrange = µ ± qnorm(97.5) · √
n
And for a sample of the population we use the t-distribution:
s
CIrange = µ ± qt(97.5, n − 1) · √
n
With s the standard deviation of the sample instead of σ which we use for the whole population.
In R for normal distributed population:
mu = mean(birthweights); sd = sd(birthweights); size = length(birthweights)
error = qnorm(0.975)*sd / sqrt(size) # or with qt if we have sample
lowerbound = mu - error; upperbound = mu + error
• Strong outcome: H0 rejected, H1 is true. Weak outcome: H0 not rejected. Type 1 error: rejecting H0
while it is true, type 2: not rejecting H0 while it is false.
• Power depends on amount of data: 1-Probability(type 2 error), thus power is the probability of correctly
rejecting H0 (seeing an effect which is really an effect). If we want to know the power of our test,
we repeat the test 1000 times where we initialize the distribution of our sample x and y based on
parameters, do a t-test, and then calculate how often the p value is below our threshold of 0.05.
We can calculate this fraction as the mean of the total amount of tests. For this example, the null
hypothesis we are testing is H0 : nu = mu.
b = 1000; nu = 175, mu = 180, m = n = 30; sd=5; p_values = numeric(b);
for (b in 1:B) {
x=rnorm(n,mu,sd); y = rnorm(m,nu,sd);
p[b] = t.test(x,y,var.equal=TRUE)[[3]]} #3rd value is the p-value for our H_0
power= mean(p<0.05)}
• There are three ways to reject H0 : t-value bigger than quantile, p-value lower than 0.05, mean not
in confidence interval (so the mean we want to test is not in the range of the calculated mean of the
sample plus or minus 2 std).
• Since we don’t know the distribution, we generally use the t-distribution with t0.025,n−1 (2.5%, n-1
degrees of freedom); this makes CI bigger (more conservative) since t > z, which was 1.96.
• Two sample t-test we calculate by subtracting sample means and dividing by standard error of the two
samples (e.g., adding SE1 with SE2, divide by Size1 + size2 - 2) from which we have our T. Unreliable
for Size below 20. In R it is simple: t.test(x, y, and create x an y with x = rnorm(size, mean, variance).
• For one-sample test (is the data mean equal / smaller / bigger to / than a certain mean?) we can use
t-test or sign-test. For normal data, t-test has bigger power (closer to 1), since t-test has a stronger
assumption (data must be normal) and thus better performance than sign-test for normal data, since
sign-test does not assume a normal distribution. We can do a one-sided t-test as follows, in this case
to check if mean is bigger than 2800:
2