Notes on lectures ‘Missing Data theory and Causal Effects’
Lecture 1: Introduction to missing values
Wat is dark data? Data is something observed, but dark data is concealed from us. They are
real, but we don’t see them for one reason. Because of that we are at risk of misunderstanding,
of drawing incorrect conclusions, and of making poor decisions.
Example dark data: video of airplanes + challenger space shuttle. It illustrates the big impact
of dark data on functioning.
These days: we all generate data, for example in the supermarket. Do we really have missing
data?
But you want to make inferences, so predicting future scenarios. For these types of data, you
need different data. Data that would be relevant can therefore be missing.
Can we discriminate “no” from missing? In a questionnaire you can have two options why a
respondent did not respond to the question, if it is not well designed. You do not know if
something is missing or not.
There are different dark data types.
- DD-Type 1: Data We Know Are Missing. For example, item-missings within a
questionnaire. We know which data are missing: we know who we sent the
questionnaire to and what questions we asked.
- DD-Type 2: Data We Don’t Know are Missing. Could be more dangerous, because we
might think that we might know something when we do not. It is hard to say if it is
missing or not.
- DD-Type 3: Choosing Just Some Cases. Mechanism that creates missing/dark data.
- DD-Type 4: Self-Selection. Mechanism that creates missing/dark data.
- DD-Type 5: Missing What Matters. What happens if you change something?
- DD-Type 6: Data Which Might Have Been.
- DD-Type 7: Changes with Time. For example, with the seasons.
- DD-Type 8: Definitions of Data.
- DD-Type 9: Summaries of Data.
- DD-Type 10: Measurement Error and Uncertainty.
- DD-Type 11: Feedback and gaming.
- DD-Type 12: Information Asymmetry. When one party knows more than another.
- DD-Type 13: Intentionally Darkened Data. For example, in an experiment, or in a
questionnaire with modules. On the control of the researcher.
- DD-Type 14: Fabricated and Synthetic Data. It can be used for the benefit or at the
costs of others.
- DD-Type 15: Extrapolating beyond Your Data.
What are missing values? A missing value is a value that is not observed. The values do exist
in theory, but we are unable to see them. There are many possible reasons. In this course we
will look for reasons that occur when conducting scientific research, for example non-
response.
We have different types of non-response. In the literature two types: Unit non-response and
item non-response. For unit non-response there is no observed response at all for that case.
For item non-response, we have some responses missing for a case (but not all).
,You can classify missing values in three groups.
Unintentional. The decision of the respondent, the missing value should have been observed.
Intentional. Under the control of the investigator, the missing value should not have been
observed.
Deductive missings. Missing values whose true value can be deducted from the observed data.
You can calculate this from other variables, this is not really a missing value. We wouldn’t it
consider missing, but you may need to account the relationships in the data.
Combining the causes and the types of non-response:
Intentional Unintentional
Unit nonresponse Sampling Refusal
Self-selection
Item nonresponse Branching Skip question
Matrix sampling Coding error
You get different effects combining the forms:
Sampling. If you sample from a population, we have an intentional unit-nonresponse, because
some people are in the sample and others are not.
Branching. The intention skipping of questions that do not apply.
Matrix sampling. You don’t ask everything to your respondents, but you ask out different
models.
Refusal. I don’t want to fill out the response. This will bias the results in a certain way.
Self-selection. I want to have my voice heard, so I am going to fill out this questionnaire. This
will bias the results in a certain way.
Skip-question. A question that the respondent does not want to answer.
Coding error.
Terminology
Complete data = Observed data + Unobserved data; all the data
Incomplete data = Observed data; just the part that is observed of the complete data
Missing data = Unobserved data; we haven’t seen the missing data
Complete cases = subset of rows in the observed data without missing values; for example
respondents without item-nonresponse. It depends on how many variables you observe, with
more variables there is more chance of missingness.
Complete variables = subset of columns in the observed data without missing values; these
are the variables that do not contain missing variables
We use blue (observed) and red (unobserved data; missing data) to indicate the complete data.
But you do not always see the missing data, so the unobserved data.
There can be many reasons why values can be missing.
- Death
- Dropout
- Refusal
- Routing
- Matching: combining different datasets, but not all respondents are the same
- Too far away (e.g. deep space): you cannot see them
- Too small to observe (e.g. particles)
- Bad luck
It is important to know what the reason is for you missing data, because you need it to correct
for it. In general, think about the reasons why something is not observed/missing.
,What if the important information is missing? If not all the necessary information is captured,
our inference may be wrong. This can be due to errors with respect to:
- Sampling. Does the sampling match the research goals?
- Coverage. Is the target population the same as the targeted population?
- Non-contact. Unable to reach respondent.
- Incompetence (interviewer/researcher). Researcher that is not competent enough at
measuring the data.
- Refusal. Respondent does not want to answer.
There are many reasons in the data that you do collect that may be distorted. That is important
know about your data. Ideally you have to know where your data comes from and what types
of additional sources of error play on top of this.
We can say in some sense that everything is a missing data problem… Why? There is always
some source of information missing. So, data may be missing, assumptions are usually made,
some form of sampling is used, or a theoretical distribution is used for comparison.
Many things are missing data problems in disguise.
Example sampling: If you take a research question, with a population. But you are not going
to measure your whole population. Our observed data is in the sample. Our missing data are
in the population, so we are trying to generalize. The whole population is thus the complete
data.
Example experiment: in a treatment group and control group that are mutually exclusive, you
do not have information on the outcome that would have occurred if they had a placebo of
people that received the treatment. You have missing data on what the result would have been
if you have had the placebo or vice versa. A counterfactual way of thinking about causality.
, Example matching: if you have two data sources with different respondents, you can stack
variables. But there are still variables that are unknown, so all the relationships between the
two datasets have to via the shared variables. So, there are missing blocks that occur.
Once you have a missing data problem, what is the problem? For example, if you do not have
the weight of one respondent, the weight is incomplete. Because we have missing values, our
statistics are not defined. The problem is that we cannot calculate the mean, we do not know
how to take into account the missing value. A missing value is not zero (its true counterpart
may be, but we do not know that). So, we cannot fill out the missing value and therefore we
can only calculate the mean over the observed set, N minus the missings. However, it could be
the case that the weight of the missing person can be different from all the other respondents.
It biases your estimates systematically.
If we do not know the mean, we can also not calculate the variance or correlation with any
other variable. In conclusion, many statistics are not defined on incomplete data.
What ese is a problem? You have a lower response rate, so when analysing the data, you have
a lower statistical power and a smaller effect size. This is a lesser problem than the bias that
we get, but still a problem because we more uncertainty about the things that we are
estimating. It is harder to find a significant difference when such a difference indeed exists.
However, if you have another variable that correlates with your missingness variable (for
example age that correlates with weight) you have a possible solution: you can estimate
(within a range) the possible value of the missing value. You have more information that you
can use. But you are still making assumptions about the relationship!!
How can we use that information then? What do the observed data tell us? Look at the
correlation in a scatterplot, then you can use X to say something about Y. But the relationship
itself is also uncertain, the model is not that good that it would be a perfect predictor.
There will always be uncertainty in any model, parts that you cannot explain in the model!
What is the most likely value? Use the correlation matrix, but we are not sure… So our next
question is: How certain am I of the most likely value? If you are very certain (high certainty)
then we can impute (fill in) the missing value. But what to do when the certainty is low?
We can express the amount of certainty by a distribution. If you are very certain, the variance
is very tightly distributed. It can also be that our model is very bad, then we have low
certainty. If X has a lot of variance, it there is a very large range where you can fill in Y. So,
how certain we are depends on our model and the information we have.
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller willemijnvanes. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $7.01. You're not tied to anything after your purchase.