Schafer, J. & Graham, J. – Missing Data: Our view of the State of the Art
Most data analysis are not designed for missing data. Missingness is usually a nuisance, not
the main focus of inquiry. Most researches resort to editing the data to lend an appearance
of completeness. Unfortunately this can lead to biased, inefficient, and unreliable answers.
What Is a Missing Value?
Missing values are part of the more general concept of coarsened data, which includes
numbers that have been grouped, aggregated, rounded, censored, or truncated, resulting in
partial loss of information. Latent variables are closely related to missing data, which are
unobservable quantities (e.g. intelligence) that are imperfectly measured by test of
questionnaire items.
Historical Development
Until the 1970s, missing values were handled primarily by editing. The formulation of the
EM (expectation-maximization) algorithm made it feasible to compute ML (maximum
likelihood) estimates in many missing-data problems. ML treats the missing data as random
variables to be removed from the likelihood function as if they were never sampled.
Later the idea of MI (multiple imputation) was introduced, in which each missing value is
replaced with m>1 simulated values prior to analysis.
Goals and Criteria
A missing value treatment can’t be properly evaluated apart from the modeling, estimation
or testing procedure in which it is embedded (e.g. mean substation –replacing each missing
value for a variable with the average of the observed values- may accurately predict
missing data, but distort estimated variances and correlations).
When Q is a population, and ^Q an estimated of Q based on a sample data, then if the
procedure will have ^Q close to Q. We thus want the difference, the bias, to be small. Bias/
variance are often calculated by (^Q-Q)², which is the mean square error. But this does not
yet describe the measures of uncertainty.
When missing values occur for reasons beyond our control, we must make assumptions
about the processes that create them. These are usually untestable.
Finally, one should avoid tricks that apparently solve the missing-data problem but actually
redefine the parameters or the population.
Types and Patterns of Nonresponse
Unit nonresponse is when the entire data collection procedure fails (e.g. sampled person is
not at home). Item nonresponse is when partial data available (e.g. sampled person does
not respond to certain items). Especially in longitudinal studies, both concepts are common,
which is referred to as wave nonresponse. Attrition/dropout is when one leaves the study
and does not return.
A univariate pattern is when missing values occur on an item Y, but a set of p other items
X1, X2..Xp is completely observed (see figure 1a).
A monotone pattern is when items or item groups (Y1, Y2..Yp) may be ordered in such a
way that if Yj is missing for a unit, then Yj+1 are missing as well (see figure 1b).
An arbitrary pattern is when any set of variables may be missing for any unit (see figure
1c).
The Distribution of Missingness
R is referred to as the missingness. The form of missingness depends on the complexity of
the pattern. When R=1, it indicates whether Y is observed. When R=0, it indicates whether
Y is missing.