Beschrijvende en infertiele statistiek
0.1. Introductie
Statistiek gaat over hoe er met gegevens (data) omgegaan moet worden
(verzamelen/ordenen, bewerken en interpreteren).
Beschrijvende:
1. Hoe ziet de data eruit?
2. Samenvatten verkregen data.
Infertiele:
1. Wat zeggen de data over de gehele populatie (generaliseren)?
2. Uitspraken, voorspellingen doen over de gehele populatie op basis van
data.
0.2. Statistische concepten
Hoe zien kwantitatieve data eruit en hoe praten we daarover?
Variabelen: eigenschappen van cases die variëren over de verschillende
casus.
Case: ding waar je geïnteresseerd in bent, iets over wil zeggen.
Levels of measurement:
Nominaal & ordinaal zijn categoraal en interval & ratio zijn kwantitatief.
Qualitative values can be:
- Discrete (set of seperate numbers)
- Continuous (infinite region of values)
, 1.1. Data beschrijven
Data matrix: matrix that structures observations about the variables of our
cases (is needed for all statistic analysis).
Data matrix often is huge -> summary of data
Frequentie table shows how the values are distributed over the cases
(frequency of variables).
Sometimes ordinal categories are needed to make frequency tables more
useful -> lose some information for better overview
Bijvoorbeeld
Nominal Ordinal
Weight (kg) Freq % Weight (kg) Freq %
. .
65,3 2 0,5 Less than to 8 2
65,4 1 0,25 60 – 69,9 69 17,2
5
65,5 3 0,50 70 – 79,9 273 68,2
5
Frequenty tables can be used to make informative graphs. For example:
Nominal/ordinal:
- Pie chart
- Bar graph
A high number of categories can make a pie chart hard to understand.
- Dotplot: dots for every case with specific value of variable.
Useful for small samples, with big sample becomes messy or hard to
understand
- Interval/ratio, histogram: bars to portray frequencies of the possible
values of a variable.
Difference bar graph: bars touch – this represents the underlying
continuous scale
Shapes:
- Bell shaped
Unimodal
- Skewed to the right
- Skewed to the left
- Two peaks -> bimodal
1.2. Maten van centraliteit
As summary it can also be useful to center data.
Centering:
Mode, value that occurs most frequently (often with nominal/ordinal
variable).
Median, middle value of observations when arranged from smallest to
largest (with no middle -> median = average two middle values.
Mean, sum of all values divided by number of observations.
, With nominal variable -> mode is used for centering
When outlier(s) are present in sample, the median can be a more accurate
tool to describe the center of the distribution.
Otherwise go for the mean.
1.3. Maten van variatie
To describe a distribution, we need more than the measures of central
tendency (Two distributions can have the same mode, median and mean
while having different amounts of variation).
Variability = dispersion
Range: highest value – lowest Interquartile range. Distribution in
value (difference) four equal parts, range between Q3
and Q1
+ easy to understand + better impression variability
+ simple to compute + leaves out the extreme values,
not affected by outliers
- Not good impression of + distribution 4 equal parts
variability, only takes
extreme values into account
Boxplot
Q2 = median
Q1= middle value left side of median
Q3= middle value right side of median
IQR= Q3 – Q1
Outlier = Q1- 1,5 x IQR
Q3+1,5 x IQR
Between whiskers is range of data after removing the outliers.
Variance:
Sum of squares
Sample size