Summary articles Crime and Safety
Week 47: Introduction: Criminology and Safety
Methodological Quality Standards for Evaluation Research
(Farrington)
Introduction to the Campbell Approach
The preferred approach of the Campbell Collaboration Crime and Justice
Group is not for a reviewer to attempt to review all evaluation studies on a
particular topic, however poor their methodology, but rather to include
only the best studies in systematic reviews.
- Campbell was clearly one of the leaders of the tradition of field
experiments and quasi experimentation
- However, this policy requires the specification of generally accepted,
explicit, and transparent criteria for determining what are the best
studies on a particular topic, which in turn requires the development
of methodological quality standards for evaluation research.
- However, not everyone agrees with the Campbell approach. The
main chal- lenge to it in the United Kingdom has come from Pawson
and Tilley (1997), who have developed “realistic evaluation” as a
competitor
- argued that the Campbell tradition of experimental and quasi-
experimental evaluation research has “failed” because of its
emphasis on “what works.” Instead, they argue, evaluation research
should primarily be concerned with testing theories, especially about
linkages between contexts, mechanisms, and outcomes
Goal of this article
This article is an attempt to make progress in developing methodological
quality standards.
- What are the features of an evaluation study with high
methodological quality? In trying to specify these for criminology
and the social and behavioral sciences, the most relevant work—
appropriately enough—is by Donald Campbell and his col- leagues
This article, then, has three main aims:
1. to review criteria of methodological quality in evaluation research,
2. to review methodological quality scales and to decide what type of
scale might be useful in assisting reviewers in making inclusion and
exclusion decisions for systematic reviews, and
3. to consider the validity of Pawson and Tilley’s (1997) challenge to
the Campbell approach.
Methodological Quality Criteria
1
,Methodological quality depends on four criteria: statistical conclusion
validity, internal validity, construct validity, and external validity.
- This validity typology “has always been the central hallmark of
Campbell’s work over the years”
- “Validity” refers to the correctness of inferences about cause and
effect
From the time of John Stuart Mill, the main criteria for establishing a causal
relationship have been that:
(1) the cause precedes the effect,
(2) the cause is related to the effect, and
(3) other plausible alternative explanations of the effect can be excluded.
The main aim of the Campbell validity typology is to identify plau- sible
alternative explanations (threats to valid causal inference) so that
researchers can anticipate likely criticisms and design evaluation studies
to eliminate them.
if threats to valid causal inference cannot be ruled out in the design,
they should at least be measured and their importance estimated.
Statistical Conclusion Validity
Statistical conclusion validity = concerned with whether the presumed
cause (the intervention) and the presumed effect (the outcome) are
related.
- Measures of effect size and their associated confidence intervals should
be calculated.
- Statistical significance (the probability of obtaining the observed effect
size if the null hypothesis of no relationship were true) should also be
calculated, but in many ways, it is less important than the effect size
- This is because a statistically significant result could indicate a large
effect in a small sample or a small effect in a large sample.
The main threats to statistical conclusion validity are insufficient
statistical power to detect the effect (e.g., because of small sample size)
and the use of inappropriate statistical techniques (e.g., where the data
violate the underlying assumptions of a statistical test). Other threats to
statistical conclusion validity include the use of many statistical tests (in a
so-called fishing expedition for significant results) and the heterogeneity of
the experimental units (e.g., the people or areas in experimental and
control conditions). The more variability there is in the units, the harder it
will be to detect any effect of the intervention.
Statistical power = refers to the probability of cor- rectly rejecting the
null hypothesis when it is false
Internal Validity
Internal validity = refers to the correctness of the key question about
whether the intervention really did cause a change in the outcome, and it
has generally been regarded as the most important type of validity
- In investigating this question, some kind of control condition is essential
2
,to estimate what would have happened to the experimental units (e.g.,
people or areas) if the intervention had not been applied to them—termed
the “counter- factual inference.”
- Experimental control is usually better than statistical control.
The main threats to internal validity have been identified often but do not
seem to be uniformly well known (Shadish, Cook, and Campbell 2002, 55):
1. Selection: the effect reflects preexisting differences between
experimental and control conditions.
2. History: the effect is caused by some event occurring at the same
time as the intervention.
3. Maturation: the effect reflects a continuation of preexisting trends,
for example, in normal human development.
4. Instrumentation: the effect is caused by a change in the method
of measuring the outcome.
5. Testing: the pretest measurement causes a change in the posttest
measure.
6. Regression to the mean: where an intervention is implemented
on units with unusually high scores (e.g., areas with high crime
rates), natural fluctuation will cause a decrease in these scores on
the posttest, which may be mistakenly interpreted as an effect of
the intervention. The opposite (an increase) happens when
interventions are applied to low-crime areas or low-scoring people.
7. Differential attrition: the effect is caused by differential loss of
units (e.g., people) from experimental compared to control
conditions.
8. Causal order: it is unclear whether the intervention preceded the
outcome.
In principle, a randomized experiment has the highest possible internal
validity because it can rule out all these threats, although in practice,
differential attrition may still be problematic.
- Randomization is the only method of assignment that controls for
unknown and unmeasured confounders as well as those that are known
and measured.
Construct Validity
Construct validity = refers to the adequacy of the operational definition
and measurement of the theoretical constructs that underlie the
intervention and the out- come
Threats:
- The main threats to construct validity center on the extent to which the
intervention succeeded in changing what it was intended to change (e.g.,
how far there was treatment fidelity or implementation failure) and on the
validity and reliability of outcome measures (e.g. how adequately police-
recorded crime rates reflect true crime rates)
- Other threats to construct validity include those arising from a
participant’s knowledge of the intervention and problems of contamination
3
, of treatment (e.g., where the control group receives elements of the
intervention). To counter the Hawthorne effect, it is acknowledged in
medicine that double-blind trials are needed, wherein neither doctors nor
patients know about the experiment.
External Validity
External validity = refers to the generalizability of causal relationships
across different persons, places, times, and operational definitions of
interventions and outcomes (e.g., from a demonstration project to the
routine large-scale application of an intervention).
- External validity can be established more convincingly in systematic
reviews and meta-analyses of numerous evaluation studies.
Threats:
- The main threats to external validity listed by Shadish, Cook, and
Campbell (2002, 87) consist of interactions of causal relationships (effect
sizes) with types of persons, settings, interventions, and outcomes. A key
issue is whether the effect size varies according to whether those who
carried out the research had some kind of stake in the results (e.g., if a
project is funded by a government agency, the agency may be
embarrassed if the evaluation shows no effect of its highly trumpeted
intervention).
Descriptive Validity
Descriptive validity = refers to the adequacy of the presentation of key
features of an evaluation in a research report.
- As mentioned, systematic reviews can be carried out satisfactorily only if
the original evaluation reports document key data on issues such as the
number of participants and the effect size:
1. A list of minimum elements to be included in an evaluation report
would include at least the following:
Design of the study: how were experimental units allocated to
experimental or control conditions?
2. Characteristics of experimental units and settings (e.g., age and
gender of individuals, sociodemographic features of areas).
3. Sample sizes and attrition rates.
4. Causal hypotheses to be tested and theories from which they are
derived.
5. The operational definition and detailed description of the
intervention (including its intensity and duration).
6. Implementation details and program delivery personnel.
7. Description of what treatment the control condition received.
8. The operational definition and measurement of the outcome
before and after the intervention.
9. The reliability and validity of outcome measures.
10. The follow-up period after the intervention.
11. Effect size, confidence intervals, statistical significance, and
4