100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Summary for the course Advanced Data Analysis $4.81
Add to cart

Summary

Summary for the course Advanced Data Analysis

1 review
 156 views  7 purchases
  • Course
  • Institution

Summary for the course 0HM120: Advanced Data Analysis. Consists of the following: - Summary of all slides - Mandatory reading materials (lecture_notes_1, Haans(2008) , Rosnow and Rosenthal (1995), Haans (2018), Spencer et al. (2005), Zhao et al. (2010) and Borenstein et al. (2009). ...

[Show more]

Preview 4 out of 33  pages

  • October 29, 2019
  • 33
  • 2019/2020
  • Summary

1  review

review-writer-avatar

By: djameelad • 5 year ago

avatar-seller
0HM120: Advanced Data Analysis
Descriptive and inferential statistics:
Introduction to descriptive and inferential statistics
Through the use of statistics, we aim to answer questions about an unattainable population
on the basis of a sample.
Dependent variable: the variable that is measured.
Independent variable: the variable that is changed/controlled during the experiment.
Nominal variables classify objects into qualitatively distinct groups.
Binominal variable: nominal variable with two groups.
Ordinal variables: similar to nominal, but the various groups can be rank ordered with
respect to some underlying characteristic.
Interval variable: quantitative but lacks an absolute zero.
Ratio variable: interval with an absolute zero.

Descriptors for central tendency all describe what the most typical value for a certain
variable X in the population is. Common descriptors are the mean, mode and median.
- Mode: value that occurs most often in the population.
n

- Mean: arithmetic average of the population (μμx), calculated by:
∑ xi
i
μx=
N
- Median: middle value if all values are ordered. If it is an even number of data, it is the
arithmetic average of the two middle values.

Spread: how much people in the population differ from each other, or from what is typical.
Three different descriptors are:
- Range: difference between the two most extreme values. It provides a numerical
descriptor of the maximum difference between the people in our population.
n
2
- Population variance (μσ ) is calculated by: σ x =∑ ¿ ¿ ¿ . It is the average squared
2
x
i
difference between a participant’s value and the mean value.
n
- Standard deviation of the population σ x =
√∑
i
¿ ¿ ¿ ¿..

Frequency distribution (μpopulation distribution): many variables in psychology approximate
the normal or gaussian distribution, which is bell-shaped, symmetrical and extents to infinite
in both tails of the distribution. If a variable X is normally distributed in the population, or
approximates a normal distribution, then 68.3% of the population have an x that falls
between the population mean – one SD and the population mean + one SD. 95.5% will fall
between mean-2SD or mean+2SD. It applies to every normal distribution.

All variables with a normal distribution can be transformed into a standard normal distribution
or Z-distribution, which has a mean of 0.00 and a SD of 1.00. for this, each datum on
variable X is converted into its respective score on the standard normal distribution (μz-score),
xi −μx
by: z i=
σx
x́ = the mean of x in the sample (μwhat we know).
μx= the mean of x in the population (μwhat we want to know).
^μx= estimated mean of x in the population on basis of the sample (μbest we can do).

Central limit theorem (μCLT): whatever the distribution of X in the population is, if you take
many large samples (μsample sizes of n>40) and calculate the mean of each sample, then


1

,the distribution of these sample means is a normal distribution. The mean of the sampling
distribution of means equals the population mean: M (μ x́ ) = μx.
According to CLT the SD of the sampling distribution is the Standard Error (μSE), which can
σx
be calculated with: SEx́ = .
√n
95% of all possible sample means fall between μx - 1.96 * SE < x́ + 1.96 *SE.

Hypothesis: statement about parameters of populations.
Hypotheses testing: testing whether or not we can faithfully reject such statements, called
the null hypothesis (μH0), against the empirical evidence. If we refute, or reject the null
hypothesis, we do so in favor of an alternative hypothesis (μH1).

Type 1 error: false positive, incorrectly rejecting the null hypothesis.
Type 2 error: false negatives, not rejecting the null hypothesis when it is false.
Type 3 error: having a good answer, to the wrong question.

P-value: reflects how surprising an observed sample mean is against the value hypothesized
in the null-hypothesis. The lower the p-value, the more surprising the observed sample mean
is and the stronger the evidence against the null hypothesis. It is best interpreted as the
likelihood of finding the observed mean or one that is more different from the hypothesized
value under the assumption that the H0 is true.
Answers the question: what the likelihood of finding a certain observed value is, or a more
extreme value under the assumption that the null hypothesis is true.

The population standard deviation can be estimated by:
n
σ^ x =s x = √ ∑ ¿¿¿¿
i
There are two non-desirable consequences of estimating the population SD:
1. SD of the sampling distribution is now based on an estimate as well.
2. The sx is not a particularly good estimate of the population SD. Therefore, the
sampling distribution of means will not exactly be a normal distribution. The smaller
the sample size n is, the less the sampling distribution of means approximates the
normal distribution.
Degrees of freedom: amount of (μtrue and thus non-redundant) information in the data.

Student t-distribution: shape depends on the degrees of freedom (μdf) or
the effective sample size. When estimating the mean of a population the
degrees of freedom are calculated by n-1. If the degrees of freedom are
very large, the student t-distribution is similar to the normal distribution.
∆ x́ − x́ B −∆ μ
Independent samples t-test: t=
A H0
. The SE is the SD of the
SE∆ x́ A−x́ B

sampling distribution of different scores, it is calculated from the observed data as follows:
1
SE∆ x́ A−x́ B
=S p
√ na
+1/ nB .

( n A −1 ) S 2A +(n B−1) S2B .
Sp is the estimated pooled SD of the population: σ^ p=S p=
√ n A + nB −2

Power analysis: deciding on a reasonable sample size for your study.
Effect size: difference between the hypothesized and an expected or estimated value.




2

,Cohen’s d: when comparing a single group mean against a hypothesized value. D of 0.2 is
x́−μ H 0
^
small, 0.5 is moderate and 0.8 is large. It can be calculated as: d= .
sx
∆x́ −∆ H 0
−x́ B
^
When it is compared to independent groups, it can be calculated with:d= A
.
sp
1 1
^
In the case of the independent sample t-test it is: d=t
√ +
n1 n2
95% confidence interval (μCI): frequentist interpretation is that 95% of all CIs, as estimated on
the basis of all possible (μhypothetical) samples, will enclose the population mean within its
intervals.

Bootstrapping: taking a large number of samples of sample size n from your original sample.
The size n of each bootstrap sample should be equal to the size of the sample you took. For
example, consider a sample of n=8, first take 1000 random samples of n=8 (μWith
replacement) and calculate the mean of each sample. Then order the 1000 means of the
bootstrap sample smallest to largest, find the 25th and 976th. These are the lower and upper
bound of the confidence interval, which is the percentile-based method for calculating 95%
bootstrapping confidence intervals.

Counter-null hypothesis: alternative value for the null-hypothesis. One that yields the same
p-value as when the observed difference is tested against the null-hypothesis that the
difference is zero.
The null hypothesis is rejected when the difference of 0 is included in the confidence interval
of the observed difference.

Power: long run probability of rejecting the null hypothesis. It reflects the sensitivity of your
testing procedure. It can be increased by setting a different confidence level higher than the
α=0.05, but the long run probability of incorrectly rejecting the null hypothesis will increase.
The best you can do is increase sample size, which will reduce the spread and therefore the
critical value move closer to the hypothesized value under H0.
Power analysis: determining the needed sample size to obtain a desired approach.

Assumptions t-test:
- Normality: two reasons for assuming normality:
o When samples are small (μ<40) we cannot make an appeal to CLT on relying
that the sampling distribution of means is a t-distribution.
o Because we have to estimate the populations SD, the estimated population
mean, and the SE will not be independent unless the dependent variables
have a normal distribution.
- Homogeneity of variance only applies to situations in which two groups are
compared. It states the population variances should be equal for both groups.
- Independence of observations: the score of a person is not influenced by the score of
other people.

Haans (2008): What does it mean to be average? The miles per gallon versus gallons
per mile paradox revisited.
Efficiency paradox (μHand 1994): Two teams investigated the efficiency of cars, one English
and one French. The English team measured the amount of miles per gallon, while the
French measured the amount of gallons per miles. They found opposite conclusions.

Many statistical analyses are misdirected as the scientific question of interest is not
adequately translated into a statistical question. When the statistical question does not
match the question of interest, researchers receive the right answer to the wrong question.

3

, Hand considers the efficiency paradox to be the result of the concept of fuel efficiency being
ambiguously defined. He proposed to use the gallons per miles calculation or to focus on the
ordinal relations between the cars, which is possible since the order of the cars is the same
for each scale (μif one calculates medians instead of the mean, the paradox disappears).

However, the efficiency paradox is neither the result of an ambiguously defined efficiency
concept, nor the result of how fuel efficiency is measured. What is confusing is that the two
scales are not linearly related. The m/g scale is linear in respect to mileage and the g/m
scale is linear to the amount of fuel consumption.
Fuel efficiency is expressed in ratios of distances and volumes of fuel, therefore it is a
derived measure (μe.g. like speed). The concentration of derived measures is not
straightforward. By calculating the arithmetic mean you cannot assume that they all weigh
the same (μe.g. the trip somewhere and the trip home and the average speed). They need to
be weighted proportional to the contribution.

The example of the cars  all cars are weighted equally, regardless of their efficiencies,
because of this they assumed that each car had an equal volume of fuel in the tank. The
English engineers asked the following question:
- Take a set of n cars which, when each of the cars is given x gallons of fuel, can
together travel a distance of y miles. What would be the efficiency of an average car,
n of which can replace the original set of cars.
The French assumed that regardless of fuel efficiency each car traveled an equal distance.
The question was:
- Take a set of n cars which, when each of the cars travels y meters, together
consume x gallons of fuel. What would be the efficiency of an average car, n of which
can replace the original set of cars?
To answer the same question as the English, the French should have calculated the
harmonic mean.

If the cars are assumed to have equal amounts of fuel in the tank, then the most efficient car
contributes more to the total distance that the cars can travel, than when the cars are
assumed to drive equal distances. Therefore, the English arithmetic average Type I car is
more efficient than the French arithmetic average Type I car. Although both groups of
engineers calculated the arithmetic mean, they have asked different statistical questions. At
least one of two groups should have calculated the harmonic mean to resolve the paradox.

Slides
Data analysis is all about asking questions about specific populations, based on empirical
data. We need statistics because only a sample of the population of interest can be
considered in the data collection and statistics are used to make inferences about the
population on the basis of a sample.
Every statistic answers a specific question.

Type 3 error: giving the right answer to the wrong question.

Inferential statistics: answering questions about unknown population parameters. Measuring
X for all people in the population of interest is often impossible. Therefore, we need to make
inferences about population parameters on the basis of a sample.

Assumptions of t-test:
- Normality:
o The sampling distribution of means should be a normal distribution (μor a t-
distribution). With large samples (μn>30) Central Limits Theorem applies and
the assumption is met. If n<30 it is only met if the variable of interest has a
normal distribution in the population.

4

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller Kp2022. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $4.81. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

53022 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$4.81  7x  sold
  • (1)
Add to cart
Added