81
4. Missing Data .1
Note: if you care about the exact equations, then you should look them up on the slides, because
you cannot really get them accurately with a normal computer on word. But they should also not be
very important since the exams are open book and about understanding, rather than quoting exact
formulas for calculating anything (that’s what we have computers for, right?)
1. Everyone will have missing data problems
2. Missing data problems are the heart of statistics
Causes of missing data
● There can be all kinds of reasons why you have missing data. E.g.:
● Respondent skipped the item
● Data transmission/coding error
● Drop out in longitudinal research
● Refusal to cooperate
● .. and so on
Consequences of missing data
● If you have less data than planned, statistical power problems might arise
● There might be biases in the data analysis, such as:
○ Effect bias
○ Representativity
○ Appropriate confidence interval, p-value?
Response indicator
Random variable Y with missing data (e.g. body weight)
Random variable X contains complete covariates (e.g. age)
Response indicator
● R = 1 if Y is observed
● R = 0 if Y is missing
● R is always complete!
● Using the response indicator, we might be able to tell a missing data mechanism (see next)
Missing data mechanisms
There are three different ways/categories, in which missing data can be separated: MCAR, MAR,
NMAR. They each have their own consequences. They will be more elaborated in the following.
MCAR
● Missing Completely at Random
● Probability to be missing is not related to any factor, aka it is completely random
● P(R=0|Y,X) = P(R=0) → the chance to be missing does not depend on any specific thing
● Example: respondent accidentally skipped question.
MAR
● Missing at Random
● Probability to be missing depends on known factors
, 82
● P(R=0|Y,X) = P(R=0|X) → the chance to be missing depends on a variable, that we are
also measuring in our data (therefore, we can account for it)
● Example: Gender always observed, and men have more missing data than women
MNAR
● Not Missing at Random
● Probability to be missing depends on unknown factors. So a factor that we do not include in
our data and therefore cannot take into consideration/count for, we do not know how the
data is missing
● P(R|Y,X) does not simplify
● Example: People with high incomes have more missing data on a variable measuring
income than people with lower incomes.
Ignorable vs not ignorable missing data
● MAR (and within that MCAR) can be rather ignored, but NMAR cannot be ignored.
● MCAR test: tests H0 that data are MCAR. However, if significant it remains unknown
whether data are NMAR or MAR
○ Usually you treat missing data as MAR, because it requires the least assumptions
and is still testable.
○ You can see whether data is missing with other variables (by seeing whether they
are dependent on each other). But it can also still be that those data points are
missing because of other variables that are not in the data set, or that those are
confounded by other variables.
Strategies to deal with missing data
There are different ways to deal with missing data: Prevention, simple methods, Likelihood
methods (EM), and multiple imputation. Each will be discussed in the following.
Prevention
● Prevention is always the best. For example in Qualtrics, make it a forced response so
people HAVE to answer before they are able to continue. That way you make sure you do
not have missing data etc. and therefore do not have to deal with it later on.
Simple methods
● Listwise deletion - complete-case analysis: as soon as someone is missing one datapoint,
they are not being included in the whole analysis
○ Advantages: Simple (default in SPSS), Correct standard errors, significance levels,
Works in some special NMAR cases (Little, 1993; Vach 1994)
○ Disadvantages: Wasteful, Same data - different n, OK under MCAR, biased under
MAR and partly NMAR
● Pairwise deletion - available case analysis: you only take out where there is actually
information missing, you still use the rest of the data
○ Advantages: Uses all available information
○ Disadvantages: Only works under MCAR, Computational problems: Negative
variances, rank problems
○ AVOID !
● Mean substitution - you substitute the missing data-points with the mean of the sample
○ Avoid!
○ Biased under MAR, underestimates the variance, disturbs the distribution