4. Missing Data .1
Note: if you care about the exact equations, then you should look them up on the slides, because
you cannot really get them accurately with a normal computer on word. But they should also not be
very important since the exams are open book and about understanding, rather than quoting exact
formulas for calculating anything (that’s what we have computers for, right?)
1. Everyone will have missing data problems
2. Missing data problems are the heart of statistics
Causes of missing data
● There can be all kinds of reasons why you have missing data. E.g.:
● Respondent skipped the item
● Data transmission/coding error
● Drop out in longitudinal research
● Refusal to cooperate
● .. and so on
Consequences of missing data
● If you have less data than planned, statistical power problems might arise
● There might be biases in the data analysis, such as:
○ Effect bias
○ Representativity
○ Appropriate confidence interval, p-value?
Response indicator
Random variable Y with missing data (e.g. body weight)
Random variable X contains complete covariates (e.g. age)
Response indicator
● R = 1 if Y is observed
● R = 0 if Y is missing
● R is always complete!
● Using the response indicator, we might be able to tell a missing data mechanism (see next)
Missing data mechanisms
There are three different ways/categories, in which missing data can be separated: MCAR, MAR,
NMAR. They each have their own consequences. They will be more elaborated in the following.
MCAR
● Missing Completely at Random
● Probability to be missing is not related to any factor, aka it is completely random
● P(R=0|Y,X) = P(R=0) → the chance to be missing does not depend on any specific thing
● Example: respondent accidentally skipped question.
MAR
● Missing at Random
● Probability to be missing depends on known factors
, 82
● P(R=0|Y,X) = P(R=0|X) → the chance to be missing depends on a variable, that we are
also measuring in our data (therefore, we can account for it)
● Example: Gender always observed, and men have more missing data than women
MNAR
● Not Missing at Random
● Probability to be missing depends on unknown factors. So a factor that we do not include in
our data and therefore cannot take into consideration/count for, we do not know how the
data is missing
● P(R|Y,X) does not simplify
● Example: People with high incomes have more missing data on a variable measuring
income than people with lower incomes.
Ignorable vs not ignorable missing data
● MAR (and within that MCAR) can be rather ignored, but NMAR cannot be ignored.
● MCAR test: tests H0 that data are MCAR. However, if significant it remains unknown
whether data are NMAR or MAR
○ Usually you treat missing data as MAR, because it requires the least assumptions
and is still testable.
○ You can see whether data is missing with other variables (by seeing whether they
are dependent on each other). But it can also still be that those data points are
missing because of other variables that are not in the data set, or that those are
confounded by other variables.
Strategies to deal with missing data
There are different ways to deal with missing data: Prevention, simple methods, Likelihood
methods (EM), and multiple imputation. Each will be discussed in the following.
Prevention
● Prevention is always the best. For example in Qualtrics, make it a forced response so
people HAVE to answer before they are able to continue. That way you make sure you do
not have missing data etc. and therefore do not have to deal with it later on.
Simple methods
● Listwise deletion - complete-case analysis: as soon as someone is missing one datapoint,
they are not being included in the whole analysis
○ Advantages: Simple (default in SPSS), Correct standard errors, significance levels,
Works in some special NMAR cases (Little, 1993; Vach 1994)
○ Disadvantages: Wasteful, Same data - different n, OK under MCAR, biased under
MAR and partly NMAR
● Pairwise deletion - available case analysis: you only take out where there is actually
information missing, you still use the rest of the data
○ Advantages: Uses all available information
○ Disadvantages: Only works under MCAR, Computational problems: Negative
variances, rank problems
○ AVOID !
● Mean substitution - you substitute the missing data-points with the mean of the sample
○ Avoid!
○ Biased under MAR, underestimates the variance, disturbs the distribution
Alle Vorteile der Zusammenfassungen von Stuvia auf einen Blick:
Garantiert gute Qualität durch Reviews
Stuvia Verkäufer haben mehr als 700.000 Zusammenfassungen beurteilt. Deshalb weißt du dass du das beste Dokument kaufst.
Schnell und einfach kaufen
Man bezahlt schnell und einfach mit iDeal, Kreditkarte oder Stuvia-Kredit für die Zusammenfassungen. Man braucht keine Mitgliedschaft.
Konzentration auf den Kern der Sache
Deine Mitstudenten schreiben die Zusammenfassungen. Deshalb enthalten die Zusammenfassungen immer aktuelle, zuverlässige und up-to-date Informationen. Damit kommst du schnell zum Kern der Sache.
Häufig gestellte Fragen
Was bekomme ich, wenn ich dieses Dokument kaufe?
Du erhältst eine PDF-Datei, die sofort nach dem Kauf verfügbar ist. Das gekaufte Dokument ist jederzeit, überall und unbegrenzt über dein Profil zugänglich.
Zufriedenheitsgarantie: Wie funktioniert das?
Unsere Zufriedenheitsgarantie sorgt dafür, dass du immer eine Lernunterlage findest, die zu dir passt. Du füllst ein Formular aus und unser Kundendienstteam kümmert sich um den Rest.
Wem kaufe ich diese Zusammenfassung ab?
Stuvia ist ein Marktplatz, du kaufst dieses Dokument also nicht von uns, sondern vom Verkäufer fionabrosig. Stuvia erleichtert die Zahlung an den Verkäufer.
Werde ich an ein Abonnement gebunden sein?
Nein, du kaufst diese Zusammenfassung nur für 3,99 €. Du bist nach deinem Kauf an nichts gebunden.