Summary statistics
EXAM 2/11/2020 15:00-17:00
All weekly steps should be enough to prepare for the exam
You will not have to use Rstudio during the exam. You do have to be able to recognize
functions from the practica
It will be an online exam you make at home.
There will be both open-ended and MC questions
25 MC questions
6 open-ended questions, not very long answers like the lab sessions
The webinar exercises are most representative for the exam (and the practice exam of
course) All calculations you need to do will be discussed in webinar exercises. So if you can do
these exercises you are well-prepared for the calculations, but the exam also includes
theoretical questions
Week 1 - introduction
With statistics you can review and analyse the results of experiments
Statistical analyses are used to understand the data, for:
- descriptive statistics: summarizing/describing the characteristics of a sample
• Describe (sample) data without drawing conclusions
• Measures of central tendency: show which values are typical, f.e. Mean, Median, Mode
• Measures of variation/dispertion(spread): show how variable the data are, f.e. range, IQR,
variance, standard deviation
f.e.: the mean speech rate of the sample of a hundred Belgian Dutch speakers
- inferential statistics: relating variables to each other and evaluating the relationships between
variables (generalising the outcome of a sample to a population).
• Using the characteristics of a sample to draw conclusions about the entire population
f.e.: the speech rate of Dutch speaking people from the Netherlands is significantly higher than
the speech rate of Dutch speaking Belgians, based on a sample of 200 people.
statistical tests to relate a sample to a population:
• Comparing two groups to each other, or one group to a fixed value
• Associating 2 variables
• The internal consistency of questions in a questionnaire
For both kinds of statistics, the data has to be variable. This means that the cases we compare have
different values.
,For example: if you want to know if how much beer a country produces depends on how much beer
is consumed in that country, it does not make sense to investigate this if the amount of beer
consumed is exactly the same in every country.
Population= a group representing all objects of interest
For example: for an investigation focussing on the speech rate of Dutch speakers in the Netherlands
versus Belgium, the population is all Dutch-speaking people in the Netherlands and Belgium.
Parameters = the values obtained from a population
f.e.: the mean speech rate of all Dutch speakers from Belgium
Sample= a value that represents a population, without having to investigate the entire population
f.e.: the speech rate of one hundred Dutch speakers from Belgium
NL ‘steekproef’
statistics =
1 the method to analyse data
2 the measurements that are obtained from a sample
f.e.: the mean speech rate of the sample of a hundred Belgian Dutch speakers
Important: sample has to be representative for the population
sampling error= the difference between the sample statistic and the population parameter.
The smaller the sampling error, the more representative the sample.
Random sampling= the best way to draw a representative sample, because everyone in the
population has an equal chance of getting selected.
Representative sampling= using a sample that represents certain characteristics of a population,
such as ethnical group or sex.
Downside: you could overlook certain variables
Convenience sampling: the least reliable way of sampling, but the most frequently used. It means
you use data that is easily accessible, but therefore also less random.
We always need 2 types of hypothesis for statistical reasoning:
1. Research hypothesis/alternative hypothesis (Ha): ‘educated guess’
there is a relationship between two measured phenomena
Directional (expecting one value to be bigger than the other, or f.e. ‘if X increases, Y decreases’)
or non-directional (just expecting a difference between the two variables; X ≠ Y)
2. Null hypothesis (H0): there is no relationship between two measured phenomena/variables
If a significant difference is found→reject H0 and accept Ha
If no significant difference is found→retain H0
Distribution= NL ‘verdeling’. How values of a variable are distributed.
You can visualize the distribution in a graph, f.e. with the x-axis representing the value, and the y-axis
the frequency/number of occurrences.
normal distribution (normale verdeling):
• Bell-shaped
• Symmetric
• Space under the curve is 100%
• 68% of the observations is around the mean
,• we can use sd’s
• The mean, mode and median are exactly the same in a normal distribution.
• Standard normal distr.: mean=0, sd=1.
p-value (probability value): shows the probability of a certain value occurring in case H0 is true.
So: how big is the chance that you record this value ‘by chance’?
If the p-value is smaller than the alpha level (p<α), H0 can be rejected
The p-value says something about the chance of finding this particular result in random samples from
the population.
The p-value represents the chance of a type one error
One-sided to two-sided test: p*2
Two-sided to one-sided test: p/2
So: a one-sided test is more likely to give a significant result
significance level= boundary for the chance that you reject H0, even if it is true (type I error)
represented by the the α-value or significance level (default 0.05)
the smaller the significance level, the smaller the chance at a type one error and the bigger the
chance at a type two error
type one error= ‘false positive’, rejecting a true H0
type two error= ‘false negative’, rejecting a true Ha
type two errors are often because of a small power (n, the sample size, is very small a.k.a. not much
individual cases were tested)
Effect size: how big the effect is, in other words: how strong the relationship/assoiation between two
variables is
Used to quantify the difference between two variables
n increases→Effect size stays the same
n increases→p-value: lower (p-values depend on sample size and effect size)
n increases→t-value: higher (see week 4; because: the larger the t-value, the closer to the mean)
The larger the sample, the sooner you will get a significant result
one-tailed or two-tailed test:
One-tailed is used for a directional hypothesis.
f.e.: Ha = X<Y
you want to look at the left end/tail of the curve, because that is where X is smaller than Y
the end of the curve investigated is the shaded part
p-value of a one-sided test is always half the size of the p-value of a two-sided test
Variable = a characteristic of a testobject/ the individuals that you study that does not have a fixed
value. The value of a variable can be measured.
f.e.: variables of a set of words are word length and word class/gender
-univariate
-bivariate
-multivariate
, Units: persons or objects you are studying, who have certain characteristics
Variables: the characteristics, they can have different values
Values: the values the variables can have
Response= dependent variable: the value that changes as a result of some other parameter of
interest
Explanatory = independent variable: the variable that influences the outcome (and determines the
response value)
Example: you are investigating if word frequency influences the response time (time it takes people
to recognize the word). The explanatory variable is the word frequency, the response variable is the
response time.
Measurement levels:
Type of measurement level of a variable determines the possibilities for statistical analysis (low-
high):
• Nominal
unorderded categories
frequency table is possible
f.e. country of birth, favorite band
binary variable: variable with two levels, f.e. male/female
• Ordinal
ordered (ranked) scale, amount of difference between categories is unclear: intervals between
the scale points are not exactly the same
f.e. Likert-scale (strongly agree – agree – neither agree nor disagree – disagree – strongly
disagree / please rate on a scale from 1 to 5…) or year of birth in groups (After 1950, Between
1941 and 1950, Between 1931 and 1940, Between 1921 and 1930, In 1920 or earlier) or
education levels (MBO, HBO, WO etc.)
• Interval
Numerical with meaningful difference (so the intervals between the scale points are the
same/’known distances’) but no true 0
f.e. temperature in degrees Celsius or year of birth in numbers
do not allow for multiplication
• Ratio
numerical with meaningful difference and true 0 (value of 0 has a clear meaning)
f.e.: the amount of occurrences of a certain word in a text or travel time to work
Allow for multipliction: if in text A there are ten occurrences of a word and in text B 100, you can
say that the word occurs 10 times more often in text B
Quantitative/numerical = interval and ratio
Qualitative/categorical = nominal and ordinal
Parametric test= based on assumptions about a quantitative variable
Non-parametric test does not assume anything about the distribution of the data.
quantitative-scaled variables can be divided in:
-continuous
The value can be dicided, f.e. hight in cm (something can be 1, 2, but also 1,75 cm high)