Received <day> <Month>, <year>; Revised <day> <Month>, <year>; Accepted <day> <Month>, <year>
DOI: xxx/xxxx
ARTICLE TYPE
Formulating causal questions and principled statistical answers
Els Goetghebeur*1 | Saskia le Cessie2 | Bianca De Stavola3 | Erica Moodie4 | Ingeborg
Waernbaum5 | on behalf of the topic group Causal Inference (TG7) of the STRATOS initiative
1 Department of Applied Mathematics,
Computer Science and Statistics, Ghent Summary
University, Ghent, Belgium
2 Department of Clinical Epidemiology/ Although review papers on causal inference methods are now available, there is a
Biomedical Data Sciences, Leiden Medical lack of introductory overviews on what they can render and on the guiding criteria for
Center, Leiden, The Netherlands choosing one particular method. This tutorial gives an overview in situations where
3 Great Ormond Street Institute of Child
Health, University College London, an exposure of interest is set at a chosen baseline (‘point exposure’) and the target out-
London, U.K. come arises at a later time point. We first phrase relevant causal questions and make
4 Division of Biostatistics, McGill
a case for being specific about the possible exposure levels involved and the popula-
University, Montreal, Canada
5 Department of Statistics, Uppsala tions for which the question is relevant. Using the potential outcomes framework, we
University, Uppsala, Sweden describe principled definitions of causal effects and of estimation approaches classi-
Correspondence fied according to whether they invoke the no unmeasured confounding assumption
* Els Goetghebeur, Email: (including outcome regression and propensity score-based methods) or an instru-
els.goetghebeur@ugent.be
mental variable with added assumptions. We discuss challenges and potential pitfalls
and illustrate application using a ‘simulation learner’, that mimics the effect of vari-
ous breastfeeding interventions on a child’s later development. This involves a typical
simulation component with generated exposure, covariate, and outcome data that
mimic those from an observational or randomised intervention study. The simula-
tion learner further generates various (linked) exposure types with a set of possible
values per observation unit, from which observed as well as potential outcome data
are generated. It thus provides true values of several causal effects. R code for data
generation and analysis is available on www.ofcaus.org, where SAS and Stata code
for analysis is also provided.
KEYWORDS:
Causation; Instrumental variable; Inverse probability weighting; Matching; Potential outcomes; Propen-
sity score.
1 INTRODUCTION
The literature on causal inference methods and their applications is expanding at an extraordinary rate. In the field of health
research, this is fuelled by opportunities found in the rise of electronic health records and the revived aims of evidence-based
precision medicine. One wishes to learn from rich data sources how different exposure levels causally affect expected outcomes
in specific population strata so as to inform treatment decisions. While an abundance of machine learning techniques can handle
electronic health records, they too need to integrate fundamental principles of causal inference to address causal questions.1
Neither the mere abundance of data nor the use of a more flexible model pave the road from association to causation.
,2 GOETGHEBEUR ET AL
Experimental studies have the great advantage that treatment assignment is randomized. A simple comparison of outcomes on
different randomized arms then yields an intention-to-treat effect as a robust causal effect measure. However, non-experimental
or observational data remain necessary for several reasons. 1) Randomised controlled trials (RCTs) tend to be conducted in rather
selected populations, to reduce costs and for ethical reasons. 2) We may seek to learn about the effect of treatments actually
received in these trials, beyond the pragmatic effect of treatment assigned. This calls for an exploration of compliance with the
assignment and hence for follow-up exposure data, i.e. non-randomized components of treatment received. 3) In many situations
(treatment) decisions need to be taken also in the absence of RCT evidence 4) A wealth of patient data is being gathered in disease
registries and other electronic patient records; these often contain more variables, larger sample sizes and greater population
coverage than are typically available in an RCT setting. These needs and opportunities push scientists to seek causal answers in
observational settings with larger and less selective populations, with longer follow-up, and with a broader range of exposures
and outcome types (including adverse events).
Statistical causal inference has made great progress over the last quarter century, deriving new estimators for well-defined
estimands using new tools such as directed acyclic graphs (DAGs) and structural models for potential outcomes.2, 3, 4
However, research papers – both theoretical and applied – tend to start from a question that is already formalised, and often
describe published conclusions in vague causal terms missing a clear specification of the target of estimation. Typically, when
this is specified, i.e. there is a well-defined estimand, a range of techniques can yield (asymptotically) unbiased answers under
a specific set of assumptions. Several overview papers and tutorials have been published in this field. They are mostly focused,
however, on the properties of one particular technique without addressing the topic in its generality. Yet in our experience, much
confusion still exists about what exactly is being estimated, for what purpose, by which technique, and under what plausible
assumptions. Here, we aim to start from the beginning, considering the most commonly defined causal estimands, the assump-
tions needed to interpret them meaningfully for various specifications of the exposure variable and the levels at which we might
intervene to achieve different outcomes. In this way, we offer guidance on understanding what questions can be answered using
various principled estimation approaches while invoking sensibly structured assumptions.
We illustrate concepts and techniques referring to a case study exemplified by simulated data, inspired by the Promotion of
Breastfeeding Intervention Trial (PROBIT),5 a large RCT in which mother-infants pairs across 31 Belarusian maternity hospitals
were randomised to receive either standard care or an offer to follow a breastfeeding encouragement programme. Aims of the
study were to investigate the effect of the programme and breastfeeding on a child’s later development. We use simulated data
to examine weight achieved at age 3 months as the outcome of interest in relation to a set of exposures defined starting from
the intervention and several of its downstream factors. Our simulation goes beyond mimicking the ‘observed world’ by also
simulating for every study participant how different exposures strategies would lead to different potential responses. We call this
the simulation learner PROBITsim and refer to the setting as the Breastfeeding Encouragement Programme (BEP) example.
Source code for implementation is available on www.ofcaus.org.
Our aim here is to give practical statisticians a compact but principled and rigorous operational basis for applied causal
inference for the effect of point (i.e. baseline) exposures in a prospective study. We build up concepts, terminology, and notation to
express the question of interest and define the targeted causal parameter. In section 2, we lay out the steps to take when conducting
this inference, referring to key elements of the data structure and various levels of possible exposure to treatment. Sections 3
presents the potential outcomes framework with underlying assumptions, and formalises causal effects of interest. In section
4, we describe PROBITsim, our simulation learner. We then derive various estimation approaches under the no unmeasured
confounding assumption and under the instrumental variable assumption in section 5. We explain how the approaches can be
implemented for different types of exposures, and apply the methods on the simulation learner in section 6. We end with an
overview that highlights overlap and specificity of the methods as well as their performance in the context of PROBITsim, and
more generally. R code for data generation, R, SAS and STATA code for analysis and reporting, and slides that accompany this
material and apply the methods to a second case study can be found on www.ofcaus.org.
2 FROM SCIENTIFIC QUESTIONS TO CAUSAL PARAMETERS
Causal questions ask what would happen to outcome Y , had the exposure A been different from what is observed. To formalise
this, we will use the concept of potential outcomes6, 7 that captures the thought process of setting the treatment to values a À A,
a set of possible treatment values, without changing any pre-existing covariates or characteristics of the individual. Let Ya(a) be
the potential outcome that would occur if the exposure were set to take the value a, with notation a(a) indicating the action of
, GOETGHEBEUR ET AL 3
setting A to a. In what follows we will refer to A as either an ‘exposure’ or a ‘treatment’ interchangeably. Since individual level
causal effects can never be observed, we focus on expected causal contrasts in certain populations. In the BEP example there are
several linked definitions of treatment; these include ‘offering a BEP’, ‘following a BEP’, ‘starting breastfeeding’ or ‘following
breastfeeding for 3 full months’. Each of them may require a decision of switching the treatment on or off. Ideally this decision
is informed by what outcome to expect following either choice.
It is important that causal contrasts should reflect the research context. Hence in this example one could be interested in
evaluating the effectiveness of the programme for the total population or in certain sub-populations. However, for some sub-
populations the intervention may not be suitable and thus assessing causal effects in such sub-populations would not be useful.
Consider the following question: “Does a breastfeeding intervention, such as the one implemented in the PROBIT trial,
increase babies weight at three months?” Despite its simplicity, empirical evaluation of this question involves its translation
into meaningful quantities to be estimated. This requires several intermediate steps:
1. Define the treatment and its relevant levels/values corresponding to the scientific question of the study.
2. Define the outcome that corresponds to the scientific questions under study.
3. Define the population(s) of interest.
4. Formalise the potential outcomes, one for each level of the treatment that the study population could have possibly
experienced.
5. Specify the target causal effect in terms of a parameter, i.e. the estimand, as a (summary) contrast between the potential
outcome distributions.
6. State the assumptions validating the causal effect estimation from the available data.
7. Estimate the target causal effect.
8. Evaluate the validity of the assumptions and perform sensitivity analyses as needed.
Explicitely formulating the decision problem one aims to solve or the hypothetical target trial one would ideally like to conduct8
may guide the steps outlined above. In the following we expand on steps 1-5 before introducing the simulation learner in Section
3 and discussing steps 6-8 in Section 4.
2.1 Treatments
Opinions in the causal inference literature differ on how broad the definition of treatment may be. Some say that the treatment
should be manipulable, like giving a drug or providing a breastfeeding encouragement programme.9 Here, we take a more liberal
position which would also include for example genetic factors or even (biological) sex as treatments. Whichever the philosophy,
considered levels of the treatments to be compared need a clear definition, as discussed below.10
Treatment definitions are by necessity driven by the context in which the study is conducted and the available data. The causal
target may thus differ for a policy implementation or a new drug registration, for instance, or whether the data are from an RCT
or administrative data. In the BEP example we may wish to define the causal effect of a breastfeeding intervention on the babies’
weight at three months. There are several alternative specifications of a ‘breastfeeding treatment’ possible. Below we list a few
which are interconnected and represent different types of treatment decisions:
• A1 : (randomised) treatment prescription: e.g. an encouragement programme was offered to pregnant women.
• A2 : uptake of the intervention: e.g. the woman participated in the programme (when offered).
• A3 : uptake of the target of intervention: e.g. the mother started breastfeeding.
• A4 : completion of the target of intervention: e.g. the mother started breastfeeding and continued for three months.
Each of these treatment definitions Ak , k = 1, ..., 4, refers to a particular breastfeeding event taking place (or not). A public
health authority will be more interested in A1 because it can only decide to offer the BEP or not; an individual mothers interste
will be in the effect of A2 , A3 and A4 because she decides to participate in the programme and to start and maintain breastfeeding.
For any one, several possible causal contrasts may be of interest and are estimable. See Section 2.6.
It is worth noting that these various definitions are not all clear-cut. For example, while A4 = 1 may be most specific in what
it indicates, A4 = 0 represents a whole range of durations of breastfeeding: from none to ‘almost 3 months’. In the same vein,