SUMMARY STATISTICS
Exam 1
LECTURE 2
-Descriptive statistics
Statistics: why & when?
A lot of data need to be processed, dependent of type of question.
• Climate data
• Experimental data
• Survey data
• Etc.
= Obtaining data.
Less commonly used in qualitative research, because open interviews, for example, are
less structured. If these interviews would be more structured, you’d lose data, which is
why statistics is not applied here.
But both are important (qualitative + quantitative)!
Toolkit statistics, there are a lot of tools. Due to different types of
data for example. Within the course, you start to build this
toolkit. Thereafter: which tool do I have to use? And, how, what
are the results, conclusions etc.
An example: measuring differences in wind
Are winds stronger at the coast, compared to interior?
Difficulties/issues
• How to measure? At what height, which
instrument/scale?
• Variability of data, different places, moments,
times of day
We limit ourselves in what we measure
In this case… →
In every research, thus, you have boundaries with the
sample. You can try to detect difference. This, however,
does not give an answer in numbers, just conclusions
regarding similarities/differences (descriptive
statistics)
For now, descriptive statistics.
Measurement 1 beaufort scale. 0-12
Was based on sailing conditions.
Higher score = higher wind, so ordinal
Measurement 2 wind velocity in m/sec or
km/j. Scale 0-infintiy
Similar intervals → interval scale! Absolute
zero is meaningful → ratio scale!
SPSS: scale variable interval + ratio
,Measurement 3
The order is not in line with strength of wind.
Zero, namely, is stronger. This indicates a
nominal scale!
Data matrix
If you have a lot of data, it needs to be stored in
matrix.
Column: characteristics of variable, row: case or observation (scores of variable)
Data storage on itself is not telling much, basis for statistical analysis (in SPSS)
For example, projecting in frequency table could help or
plotting this in a graph. Then you can draw conclusions.
Barchart: Lowest scores most in July → less wind. However,
barchart is not useful much because wind cannot be divided
in 0,1,2… So, polygons!
Most wind → March (most to the
right). Most constant wind →
which of the graphs highest
frequency? July!
Objections: ordinal scale without similar interval. Graph,
however, indicates this. M/s scale can prevent this.
Graph is skewed to the right, long line
on the right side. The question,
however, was focused on the difference
between Bilt and Den Helder so you
have to compare different graphs.
More wind in Den Helder (more to the right). How large is the
difference?
• Cumulative distribution
• Between centers relative to distribution
Cumulative distribution; looking at frequency
observation vs. cumulative (graph). Large frequency
added → steep increase in line.
Difference in lines; ∆ = max
In this case, ∆cp = 35.5. (50-85.5). ∆ > 30 is deemed
large, largest difference between lines.
Centers; two means of frequency. Difference =
1.113. what does this tell us? We first have to look at
intermezzo!
Intermezzo
Some very important tools
• Mean (scores / number of scores)
, • Dispersion (difference score and mean score, so score – x bar)
• Variance (sum of squared deviation divided by degree of freedom (n-1))
• Standard deviation (square root of variance = √(SS / (n-1)) = √s2)
Mean deviation = not helpful in this case due to zero
Absolute = ignore the sign that is in front of it (so ignoring -)
Or squared, also remains in + numbers
Back to example:
Mean Den Helder = 3.606, s2 = 1.878, standard deviation = 1.371
Difference mean (1.113) did not say much. Distribution! Standard deviation!
Mean st. dev = 1.180
Effect size D = 1..180 = 0.94 (effect of difference). If D > 0.8 = strong effect.
D = Difference means / mean standard deviation
Are the metrics mean, sd, and effect size D
appropriate?
Not:
• Ordinale scale! (difference = meaningless)
• Skewed to the right (not symmetrical). Outliers bias mean scores.
Yes:
• Both distributions are almost normal
Median/quartiles
Alternatives for ordinal measures & skewed distributions
Median = middle score (cumulative graph)
Quartiles = dispersion (inter quartile range)
First 25% = first quartile
50% = median
Constructing a boxplot, long tail on the right
Box = middle 50%, so…
25% - 50% ; stipple line – median ; 50% - 75%
Use of descriptive statistics in research
• Data cleaning
• Data preparation
• Providing insight into dataset
First part of result section = descriptive statistics of dataset