Samenvatting

Summary for the course Advanced Data Analysis

Name: Summary for the course Advanced Data Analysis
SKU: doc_602609
Rating: 3.00 (1 reviews)
Author: Kp2022

1 beoordeling

156 keer bekeken 7 keer verkocht

Instelling
Technische Universiteit Eindhoven (TUE)

Summary for the course 0HM120: Advanced Data Analysis. Consists of the following: - Summary of all slides - Mandatory reading materials (lecture_notes_1, Haans(2008) , Rosnow and Rosenthal (1995), Haans (2018), Spencer et al. (2005), Zhao et al. (2010) and Borenstein et al. (2009). ...

[Meer zien]

Voorbeeld 4 van de 33 pagina's

Bekijk voorbeeld

Geupload op 29 oktober 2019
Aantal pagina's 33
Geschreven in 2019/2020
Type Samenvatting

1 beoordeling

Door: djameelad • 5 jaar geleden

Volgen

Kp2022 Lid sinds 8 jaar 303 documenten verkocht

0HM120: Advanced Data Analysis
Descriptive and inferential statistics:
Introduction to descriptive and inferential statistics
Through the use of statistics, we aim to answer questions about an unattainable population
on the basis of a sample.
Dependent variable: the variable that is measured.
Independent variable: the variable that is changed/controlled during the experiment.
Nominal variables classify objects into qualitatively distinct groups.
Binominal variable: nominal variable with two groups.
Ordinal variables: similar to nominal, but the various groups can be rank ordered with
respect to some underlying characteristic.
Interval variable: quantitative but lacks an absolute zero.
Ratio variable: interval with an absolute zero.

Descriptors for central tendency all describe what the most typical value for a certain
variable X in the population is. Common descriptors are the mean, mode and median.
- Mode: value that occurs most often in the population.
n

- Mean: arithmetic average of the population (μμx), calculated by:
∑ xi
i
μx=
N
- Median: middle value if all values are ordered. If it is an even number of data, it is the
arithmetic average of the two middle values.

Spread: how much people in the population differ from each other, or from what is typical.
Three different descriptors are:
- Range: difference between the two most extreme values. It provides a numerical
descriptor of the maximum difference between the people in our population.
n
2
- Population variance (μσ ) is calculated by: σ x =∑ ¿ ¿ ¿ . It is the average squared
2
x
i
difference between a participant’s value and the mean value.
n
- Standard deviation of the population σ x =
√∑
i
¿ ¿ ¿ ¿..

Frequency distribution (μpopulation distribution): many variables in psychology approximate
the normal or gaussian distribution, which is bell-shaped, symmetrical and extents to infinite
in both tails of the distribution. If a variable X is normally distributed in the population, or
approximates a normal distribution, then 68.3% of the population have an x that falls
between the population mean – one SD and the population mean + one SD. 95.5% will fall
between mean-2SD or mean+2SD. It applies to every normal distribution.

All variables with a normal distribution can be transformed into a standard normal distribution
or Z-distribution, which has a mean of 0.00 and a SD of 1.00. for this, each datum on
variable X is converted into its respective score on the standard normal distribution (μz-score),
xi −μx
by: z i=
σx
x́ = the mean of x in the sample (μwhat we know).
μx= the mean of x in the population (μwhat we want to know).
^μx= estimated mean of x in the population on basis of the sample (μbest we can do).

Central limit theorem (μCLT): whatever the distribution of X in the population is, if you take
many large samples (μsample sizes of n>40) and calculate the mean of each sample, then

1

,the distribution of these sample means is a normal distribution. The mean of the sampling
distribution of means equals the population mean: M (μ x́ ) = μx.
According to CLT the SD of the sampling distribution is the Standard Error (μSE), which can
σx
be calculated with: SEx́ = .
√n
95% of all possible sample means fall between μx - 1.96 * SE < x́ + 1.96 *SE.

Hypothesis: statement about parameters of populations.
Hypotheses testing: testing whether or not we can faithfully reject such statements, called
the null hypothesis (μH0), against the empirical evidence. If we refute, or reject the null
hypothesis, we do so in favor of an alternative hypothesis (μH1).

Type 1 error: false positive, incorrectly rejecting the null hypothesis.
Type 2 error: false negatives, not rejecting the null hypothesis when it is false.
Type 3 error: having a good answer, to the wrong question.

P-value: reflects how surprising an observed sample mean is against the value hypothesized
in the null-hypothesis. The lower the p-value, the more surprising the observed sample mean
is and the stronger the evidence against the null hypothesis. It is best interpreted as the
likelihood of finding the observed mean or one that is more different from the hypothesized
value under the assumption that the H0 is true.
Answers the question: what the likelihood of finding a certain observed value is, or a more
extreme value under the assumption that the null hypothesis is true.

The population standard deviation can be estimated by:
n
σ^ x =s x = √ ∑ ¿¿¿¿
i
There are two non-desirable consequences of estimating the population SD:
1. SD of the sampling distribution is now based on an estimate as well.
2. The sx is not a particularly good estimate of the population SD. Therefore, the
sampling distribution of means will not exactly be a normal distribution. The smaller
the sample size n is, the less the sampling distribution of means approximates the
normal distribution.
Degrees of freedom: amount of (μtrue and thus non-redundant) information in the data.

Student t-distribution: shape depends on the degrees of freedom (μdf) or
the effective sample size. When estimating the mean of a population the
degrees of freedom are calculated by n-1. If the degrees of freedom are
very large, the student t-distribution is similar to the normal distribution.
∆ x́ − x́ B −∆ μ
Independent samples t-test: t=
A H0
. The SE is the SD of the
SE∆ x́ A−x́ B

sampling distribution of different scores, it is calculated from the observed data as follows:
1
SE∆ x́ A−x́ B
=S p
√ na
+1/ nB .

( n A −1 ) S 2A +(n B−1) S2B .
Sp is the estimated pooled SD of the population: σ^ p=S p=
√ n A + nB −2

Power analysis: deciding on a reasonable sample size for your study.
Effect size: difference between the hypothesized and an expected or estimated value.

2

,Cohen’s d: when comparing a single group mean against a hypothesized value. D of 0.2 is
x́−μ H 0
^
small, 0.5 is moderate and 0.8 is large. It can be calculated as: d= .
sx
∆x́ −∆ H 0
−x́ B
^
When it is compared to independent groups, it can be calculated with:d= A
.
sp
1 1
^
In the case of the independent sample t-test it is: d=t
√ +
n1 n2
95% confidence interval (μCI): frequentist interpretation is that 95% of all CIs, as estimated on
the basis of all possible (μhypothetical) samples, will enclose the population mean within its
intervals.

Bootstrapping: taking a large number of samples of sample size n from your original sample.
The size n of each bootstrap sample should be equal to the size of the sample you took. For
example, consider a sample of n=8, first take 1000 random samples of n=8 (μWith
replacement) and calculate the mean of each sample. Then order the 1000 means of the
bootstrap sample smallest to largest, find the 25th and 976th. These are the lower and upper
bound of the confidence interval, which is the percentile-based method for calculating 95%
bootstrapping confidence intervals.

Counter-null hypothesis: alternative value for the null-hypothesis. One that yields the same
p-value as when the observed difference is tested against the null-hypothesis that the
difference is zero.
The null hypothesis is rejected when the difference of 0 is included in the confidence interval
of the observed difference.

Power: long run probability of rejecting the null hypothesis. It reflects the sensitivity of your
testing procedure. It can be increased by setting a different confidence level higher than the
α=0.05, but the long run probability of incorrectly rejecting the null hypothesis will increase.
The best you can do is increase sample size, which will reduce the spread and therefore the
critical value move closer to the hypothesized value under H0.
Power analysis: determining the needed sample size to obtain a desired approach.

Assumptions t-test:
- Normality: two reasons for assuming normality:
o When samples are small (μ<40) we cannot make an appeal to CLT on relying
that the sampling distribution of means is a t-distribution.
o Because we have to estimate the populations SD, the estimated population
mean, and the SE will not be independent unless the dependent variables
have a normal distribution.
- Homogeneity of variance only applies to situations in which two groups are
compared. It states the population variances should be equal for both groups.
- Independence of observations: the score of a person is not influenced by the score of
other people.

Haans (2008): What does it mean to be average? The miles per gallon versus gallons
per mile paradox revisited.
Efficiency paradox (μHand 1994): Two teams investigated the efficiency of cars, one English
and one French. The English team measured the amount of miles per gallon, while the
French measured the amount of gallons per miles. They found opposite conclusions.

Many statistical analyses are misdirected as the scientific question of interest is not
adequately translated into a statistical question. When the statistical question does not
match the question of interest, researchers receive the right answer to the wrong question.

3

, Hand considers the efficiency paradox to be the result of the concept of fuel efficiency being
ambiguously defined. He proposed to use the gallons per miles calculation or to focus on the
ordinal relations between the cars, which is possible since the order of the cars is the same
for each scale (μif one calculates medians instead of the mean, the paradox disappears).

However, the efficiency paradox is neither the result of an ambiguously defined efficiency
concept, nor the result of how fuel efficiency is measured. What is confusing is that the two
scales are not linearly related. The m/g scale is linear in respect to mileage and the g/m
scale is linear to the amount of fuel consumption.
Fuel efficiency is expressed in ratios of distances and volumes of fuel, therefore it is a
derived measure (μe.g. like speed). The concentration of derived measures is not
straightforward. By calculating the arithmetic mean you cannot assume that they all weigh
the same (μe.g. the trip somewhere and the trip home and the average speed). They need to
be weighted proportional to the contribution.

The example of the cars  all cars are weighted equally, regardless of their efficiencies,
because of this they assumed that each car had an equal volume of fuel in the tank. The
English engineers asked the following question:
- Take a set of n cars which, when each of the cars is given x gallons of fuel, can
together travel a distance of y miles. What would be the efficiency of an average car,
n of which can replace the original set of cars.
The French assumed that regardless of fuel efficiency each car traveled an equal distance.
The question was:
- Take a set of n cars which, when each of the cars travels y meters, together
consume x gallons of fuel. What would be the efficiency of an average car, n of which
can replace the original set of cars?
To answer the same question as the English, the French should have calculated the
harmonic mean.

If the cars are assumed to have equal amounts of fuel in the tank, then the most efficient car
contributes more to the total distance that the cars can travel, than when the cars are
assumed to drive equal distances. Therefore, the English arithmetic average Type I car is
more efficient than the French arithmetic average Type I car. Although both groups of
engineers calculated the arithmetic mean, they have asked different statistical questions. At
least one of two groups should have calculated the harmonic mean to resolve the paradox.

Slides
Data analysis is all about asking questions about specific populations, based on empirical
data. We need statistics because only a sample of the population of interest can be
considered in the data collection and statistics are used to make inferences about the
population on the basis of a sample.
Every statistic answers a specific question.

Type 3 error: giving the right answer to the wrong question.

Inferential statistics: answering questions about unknown population parameters. Measuring
X for all people in the population of interest is often impossible. Therefore, we need to make
inferences about population parameters on the basis of a sample.

Assumptions of t-test:
- Normality:
o The sampling distribution of means should be a normal distribution (μor a t-
distribution). With large samples (μn>30) Central Limits Theorem applies and
the assumption is met. If n<30 it is only met if the variable of interest has a
normal distribution in the population.

4

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper Kp2022. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €4,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 53022 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen

Populaire Universiteiten

Populaire Hogescholen

Populaire Scholen

Populaire samengevatte studieboeken voor Communicatie en Taal

Populaire samengevatte studieboeken voor Economie en Bedrijf

Populaire samengevatte studieboeken voor Exact en Informatica

Populaire samengevatte studieboeken voor Gedrag en Maatschappij

Populaire samengevatte studieboeken voor Gezondheid en Geneeskunde

Populaire samengevatte studieboeken voor Onderwijs en Opvoeding

Populaire samengevatte studieboeken voor Recht en Bestuur

De beste samenvattingen om je Wft-diploma te behalen

De beste samenvattingen om je theorie examens te behalen

De beste samenvattingen voor je cursus in de Veiligheidsbranche

De beste samenvattingen voor Gezondheid & Hygiëne cursussen

De beste samenvattingen voor zakelijke cursussen

De beste samenvattingen voor je PABO WisCAT cursus

Populaire vakken

Populaire vakken

Populaire vakken

Boekverslagen en samenvattingen

Verkoper

Samenvatting

Summary for the course Advanced Data Analysis

Document informatie

Onderwerpen

Geschreven voor

1 beoordeling

Verkoper

Ontvangen beoordelingen

Voorbeeld van de inhoud

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Snel en makkelijk kopen

Focus op de essentie

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?