Summary

Experimental Design and Data Analysis (X_405078): Complete summary with Code implementation in R

171 views 5 purchases

Course
Experimental Design and Data Analysis (X_405078)

Institution
Vrije Universiteit Amsterdam (VU)

A full overview of the course content, with implemtations in R.

[Show more]

Preview 2 out of 15 pages

View example

Uploaded on April 12, 2021
Number of pages 15
Written in 2020/2021
Type Summary

statistics
experimental design
introduction to r
r

Institution
Vrije Universiteit Amsterdam (VU)
Education
MSc Artificial Intelligence
Course
Experimental Design and Data Analysis (X_405078)

timdeboer

Member since 3 year 28 documents sold

$9.12

Add to cart

Add to wishlist

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Experimental Design and Data Analysis Course VU
Tim de Boer
February-March 2021

1 Lecture 1
Contents: recap of statistical concepts.

• The normal density curse is given by:
1 1 2
/σ 2
fµ,σ (x) = √ exp− 2 (x−µ)
2πσ 2
Where µ determines the position of the peak on the x-axis and σ determines the width of density curve.

• A quantile of α is number qα such that P (X ≤ qα ) = α. The upper quantile is the other side,
P (X > qα ) = α. The quantile value qα is the median value if we choose α = 0.5. If α = 0.25, qα is the
value such that the data is split in 25 below and 75% above qα . The function qnorm() aims to find the
boundary value, A in P(X < A), given the probability P. For example, suppose you want to find the
85th percentile of a normal distribution whose mean is 70 and whose standard deviation is 3. Then
you ask for:

qnorm(0.85,mean=70,sd=3)

• A QQ-plot can reveal whether data follows certain distribution P . It plots the theoretical ordered
probabilities from normal distribution on the x-axis (theoretical quantiles) versus the from sampling
obtained quantiles on y-axis. Linear line means correct distribution (sampled from the population).
Use the qqnorm plot together with a histogram to see if the data is normally distributed. Also plot an
boxplot to get an idea about differences and spread.

par(mfrow=c(1,2)); qqnorm(data); hist(data); boxplot(hours~environment)

• Central Limit Theorem: Sampling from a (not known) distribution and calculating mean for these
sample means. The distribution of all these sample means is more normally distributed. If you
keep sampling from the unknown distribution and keep calculating the mean of these samples, the
distribution of means becomes more and more normally distributed. The higher sample size, the
better normally distributed. When a sample is taken from the distribution N(µ, σ 2 ) then the sample
mean is N(µ, σ 2 /n): another way of describing the Central Limit Theorem: the sample mean varies
less than original mean.

• In a real dataset, the full population std σ is unknown. We replace σ with sample std we call s which
gives the T-distribution as a sort of Central Limit Theorem if we take the mean of samples:

X̄ − µ
T = √
s/ n

which does not have N(0,1) distribution due to uncertainty about full population. Instead, T has
t-distribution with n - 1 degrees of freedom.

1

, • A point estimate for a unknown parameter (for example the mean) is a function of only the observed
data, seen as a random variable. Denote them with µ̂.
• The confidence interval of 1 − α, e.g. 95%, is a random interval based only on the observed data that
contains the true value of the parameters with probability of 95%. If σ is unknown (which is true in
almost all cases), the t-confidence interval becomes [X̄ − t, X̄ + t] e.g. how confident are we that that
true proportion is is in between 2 std’s from the sample proportion p̂. If we want to calculate a 95%
confidence interval for a normal distributed ppopulation, we have to calculate the 97.5th percentile:
σ
CIrange = µ ± qnorm(97.5) · √
n
And for a sample of the population we use the t-distribution:
s
CIrange = µ ± qt(97.5, n − 1) · √
n
With s the standard deviation of the sample instead of σ which we use for the whole population.
In R for normal distributed population:

mu = mean(birthweights); sd = sd(birthweights); size = length(birthweights)
error = qnorm(0.975)*sd / sqrt(size) # or with qt if we have sample
lowerbound = mu - error; upperbound = mu + error

• Strong outcome: H0 rejected, H1 is true. Weak outcome: H0 not rejected. Type 1 error: rejecting H0
while it is true, type 2: not rejecting H0 while it is false.
• Power depends on amount of data: 1-Probability(type 2 error), thus power is the probability of correctly
rejecting H0 (seeing an effect which is really an effect). If we want to know the power of our test,
we repeat the test 1000 times where we initialize the distribution of our sample x and y based on
parameters, do a t-test, and then calculate how often the p value is below our threshold of 0.05.
We can calculate this fraction as the mean of the total amount of tests. For this example, the null
hypothesis we are testing is H0 : nu = mu.

b = 1000; nu = 175, mu = 180, m = n = 30; sd=5; p_values = numeric(b);
for (b in 1:B) {
x=rnorm(n,mu,sd); y = rnorm(m,nu,sd);
p[b] = t.test(x,y,var.equal=TRUE)[[3]]} #3rd value is the p-value for our H_0
power= mean(p<0.05)}

• There are three ways to reject H0 : t-value bigger than quantile, p-value lower than 0.05, mean not
in confidence interval (so the mean we want to test is not in the range of the calculated mean of the
sample plus or minus 2 std).
• Since we don’t know the distribution, we generally use the t-distribution with t0.025,n−1 (2.5%, n-1
degrees of freedom); this makes CI bigger (more conservative) since t > z, which was 1.96.
• Two sample t-test we calculate by subtracting sample means and dividing by standard error of the two
samples (e.g., adding SE1 with SE2, divide by Size1 + size2 - 2) from which we have our T. Unreliable
for Size below 20. In R it is simple: t.test(x, y, and create x an y with x = rnorm(size, mean, variance).
• For one-sample test (is the data mean equal / smaller / bigger to / than a certain mean?) we can use
t-test or sign-test. For normal data, t-test has bigger power (closer to 1), since t-test has a stronger
assumption (data must be normal) and thus better performance than sign-test for normal data, since
sign-test does not assume a normal distribution. We can do a one-sided t-test as follows, in this case
to check if mean is bigger than 2800:

2

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller timdeboer. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $9.12. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

50064 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling

Popular Universities in the United States

Popular books

Find notes and summaries for these qualifications

Seller