Summary Statistics
Week 1
Statistics: “The study of how we describe and make inferences from data.” (Sirkin)
Ø An inference is “a conclusion reached on the basis of evidence and reasoning.”
Ø Distinction between descriptive & inferential statistics
Different levels of statistics:
1. Univariate (one variable)
2. Bivariate (two variables)
3. Multivariate (more variables)
Descriptive vs inferential statistics: with descriptive statistics one describes only a specific
sample. Inferential statistics is about what a sample says about the whole population.
Unit of analysis: the what or who that is being studied. Also: the unit that you will be able to
draw conclusions about. Typically, all units are the same type of “thing” in a single data set.
Variable: a measured property of each of the units of analysis.
Levels of measurement
- Nominal: group classifications where no meaningful ranking is possible (e.g., religion,
country)
- Ordinal (ORDinal): There is meaningful ranking/ ordering but the distance between
categories is unknown or not equal.
- Interval: similar to ordinal because it is a ranking but the rankings are meaningful,
and the distances are equal. But: 0 does not mean anything/ means ‘lack of’
- Ratio: same as interval but zero is meaningful/ absolute zero point.
We always first need to know the level of measurement in order to know which statistical
techniques we may use for the given variables.
“A continuous variable is measured along a continuum, whereas a discrete variable is
measured in whole units or categories.” Continuous variables have decimals, discrete do
not.
,Measures of central tendency (CT): To (univariately) describe the distribution of variables on
different levels of measurement.
- The mean (interval/ ratio): all values are added up and divided by n, which is the
number of observations in the sample
Almost the same formula for the population mean:
Characteristics of the mean:
o Changing any score will change the mean
o Adding or removing a score will change the mean (unless that score is equal
to the mean)
o Adding, substracting, multipluing, dividing each score by a given value (same
‘constant’ value) causes the mean to change accordingly
o Sum of differences from the mean is zero:
o Sum of squared differences from the mean is minimal
o Most useful for describing (more or less) normally distributed variables.
- The median (ordinal, interval/ratio): the median is the middle case when sorting all
cases based on their value. Equal amount of cases above and below the median.
Also: 50th percentile.
o The median is not as sensitive to ouliers as the mean.
o Whenever n is an even number, the median is the mean value of the two
middle cases.
o Often used for interval/ratio variables that have skewed distributions.
- The mode (nominal, ordinal, interval/ratio): the mode is the category with the
largest amount of cases.
Measures of CT and distributions: Normal distribution: the mean, median, and mode are all
the same. In a skewed distribution the line is shifter to the left or right.
, Week 2 Lecture
Measures of variability: measures of CT alone carry not enough information to adequately
describe distributions of variables, we need a second type of measures.
E.g., Group 1 has 10 people aged 20 and 10 aged 60, group 2 has 10 people aged 39 and
then aged 41. In this case, the mean does not differ. However, the dispersion/ variability
differs.
The range (ordinal, interval/ratio): distance between highest and lowest score. Is always
reported together with maximum & minimum score and is sensitive to outliers.
The interquartile range (IQR) (ordinal, interval/ratio): based on ‘quartiles’ that spit the data
into four groups of cases. IQE is based on the distance between Q1 and Q3 and insensitive
to outliers since the range describes half of the data.
The variance (interval/ ratio): based on the Sum of Squares, i.e., the squared distance from
the mean. For the calculation of the variance, it matters whether we have the sample data
or the population data (typically we have sample data).
- Variance in a sample is expressed as:
To calculate the sample variance s² of a given variable:
o For each case, we calculate the distance to the sample mean and square that
distance (removes possible minus sign)
o All those squared distances are then added up and divided by the number of
cases in the sample minus one (n-1)
- Variance in a population is expressed as:
(Greek u = population mean)
To calculate the population variance σ² (sigma square) of a given variable:
o For each case, we calculate the distance to the population mean and square
that distance (removes possible minus sign)
o All those squared distances are then added up and divided by the number of
cases in the population (N)
Ø How can we interpret the value of the variance? (e.g., 4.67)
• We don’t, but: “everything is meaningful in comparison”
(i.e. when comparing variances across groups, we can make comparative
statements about more/less dispersion around the mean)
• For the purpose of interpretation, we calculate another measure of
variability: the standard deviation
Ø Why are there two different variance formulas for sample data / population data?
• We often use the sample variance as an ‘estimator’ for the population
variance (which is typically unknown)