Statistics
Lecture 1 – Introduction to Statistics – 26-08-2024
We use statistics to…
- Describe/summarize data -> descriptive statistics
o Reduce the data to understandable pieces of information
o E.g.: what proportion of Dutch adults has a driver’s license?
o E.g.: what is the average delay across all train travels in the Netherlands today?
- Drawing inferences about populations -> inferential statistics
o In science we often want to draw conclusions about populations
o E.g.: are COVID-19 vaccines safe and eIective in the general population?
o Problem: we can often only make observations on a selection of cases from a
population
o Solution: we can use inferential statistics to find out if the sample results can be
generalized to the population
- Studying complex multivariate relationships -> statistical modeling
o In research we are often interested in relationships between several variables
o People diIer in their scores on those variables. Science aims to explain as much
of these diIerences
o E.g.: to what extent does years of education predict healthy lifestyle, controlled
for income diIerences?
o Statistical modeling can help to uncover such complex relations
• Not further noted in the lectures
Measurement levels
In the social sciences we often collect quantitative data using questionnaires
Four types of data, known as measurement levels
o Nominal
o Ordinal
o Interval
o Ratio
- They diIer in how refined or exact the measurement is.
o Nominal is the lowest and ratio is the highest level
o Measuring at a lower level is often easier but less informative
o The higher the level the more information you can gather, the aim is to get the
highest level
Nominal level
- Express diIerent unordered categories or groups
- Nominal variables classifies cases into two or more categories. Categories must be
exhaustive (all possibilities should be covered) and mutually exclusive (i.e. every case
fits into one category and one category only)
- E.g. martial status (married, single, relationship, not specified otherwise)
Ordinal level
- Numbers express diIerent ordered categories (less/more)
- Clear order with each step upwards, but still categories
1
, - Ordinal variables express more or less of a quantity, but the diIerence between pairs of
categories is not necessarily the same in quantity
- There should be a logical order (e.g. not logical: never, occasionally, daily, often)
- E.g. smoking intensity (never, at least 1 per month, at least 1 per day, 5 or more per day)
Interval level
- Numbers express diIerences in quantity using a common unit with equal intervals
between the neighboring data points, but no true zero point
- E.g. IQ test score (the diIerence between 70 and 80 IQ is comparable to a diIerence
between 100 and 110. Both span a diIerence of 10 units.
- E.g. temperature (if on Monday = 30, Tuesday = 25 and Wednesday = 15, then we can say
that the temperature drop between Tuesday and Wednesday is twice as large as the drop
between Monday and Tuesday. But a temperature of 0 degrees is no absence of
temperature, so no true zero point)
Ratio level
- Numbers express diIerences in quantity on a common unit and have natural zero point
- E.g. length, weight or income (a length, weight or income of 0 can be meaningful
- The comparisons are possible, these are not possible on interval level (6 degrees is not
twice at hot as 3 degrees and someone with an IQ of 120 is not twice as smart as
someone with an IQ of 60).
Categorical/discrete variables = nominal and ordinal measurement level
Scale/continuous variables = interval and ratio measurement level
Measurement levels
- Both interval and ratio-level data are referred to as scale data
o The idea is simple: all variables that are not nominal or ordinal are treated as
scale-level variables. SPSS distinguishes between nominal, ordinal and scale.
Measurement DiIerent Ordering DiIerences Natural zero
level categories expressed in point
common unit
Nominal Yes
Ordinal Yes Yes
Interval Yes Yes Yes
Ratio Yes Yes Yes Yes
Orange = categorical/discrete variables
Blue = scale/continuous variables
- Measurement level is a property of the measurement values, it is not an intrinsic
property of the thing you are measuring
o E.g. you cannot say that ‘intelligence’ has interval level. Intelligence can be
measured at diIerent levels depending on the measurement instrument
• Nominal: variable indicating someone’s intelligence type
(musical/mathematical)
• Ordinal: variable indicating the highest completed education (e.g.
primary school)
• Interval: score resulting from an IQ test
• Ratio: skull circumference in centimeters
2
,Measurement level and statistical analysis
Measurement levels determine the kind of statistics and statistical analyses you can use
meaningfully.
- E.g. the mean of a nominal variable is meaningless (e.g. the average eye color). For the
analyses you should always respect the measurement levels of the variables you will use
in statistical analyses.
Many of the commonly used statistical techniques assume scale data.
- E.g. attitude towards a governmental policy on a scale ranging from 1 (= completely
against) to 7 (= completely in favor)
Data inspection
Every analysis starts with data inspection (getting to know your data): its goal is to get a clear
picture of the data by examining one variable at the time (univariate), or pairs of variables
(bivariate)
- In general, we want to inspect
o Central tendency: what are the most common values?
o Variability: how large are the diIerences between the subjects? Are there
extreme values in the sample?
o Bivariate association: for each pair of variables, do they
associate/covary/correlate (do low/large values on variable A go together with
low/large values on variable B?)
- To accomplish the goal, we use
o Visual data inspection (graphs)
o Numerical data inspection (statistics)
o Which statistics and graphs
Three common graph types -> kennen voor tentamen
- Bar charts = for nominal and ordinal data, could be in counts or percentages
- Histogram = scale data (interval and ratio data), has a common scale
- Scatterplot = scale data, 2+ variables, bivariate data (e.g. height, reading score)
bar chart histogram scatterplot
Normal distribution
- Mathematical distribution -> Gauss curve (black line on histogram)
- Symmetrical distribution: looks the same on both sides
- The farther you go from the center to the edges, the lower the distribution goes. It never
reaches the zero-point.
- E.g. IQ score, length, birthweight
3
, Numerical data inspection
Three common statistical approaches
- Frequency tables = how often do particular scores occur
- Central tendencies = what’s the center of scores in my data
- Variability measures = how much do people diIer in my data
Frequency table
- Numerical variable
- Percent = Frequency / Total sample size (N)
- Valid Percent = Frequency / (Total sample size (N) – missings)
Crosstable
- 2 variables
Central tendencies
- Mode = the score that is observed most frequently
o E.g. 1: [3,4,4,5,5,5] -> mode is 5
o For nominal, ordinal or scale data
- Median = the score that separates the higher half of data from the lower half
o You need to order the scores and then find the median
o E.g. 1: (N = unequal); [5, 6, 7, 8, 9] -> median is 7
o E.g. 2: (N = equal): [5, 6, 8, 9] -> median is 7
• Arithmetic mean of the two middle values 6 and 8
o For ordinal or scale data that are not normally distributed
!# %&' () *++ %,(-.%
- Mean = M = $
= /(/*+ 0&'1.- () %,(-.%
o E.g. 1: [2, 3, 10] -> mean is 15/3 = 5
o For ordinal or scale data that are normally distributed
Normal and skewed distribution
Normal distribution
4