1
Problem 4 - Good research practices
Spellman, Gilbert & Corker (2017) - Open science: What, why and how
The “Open Science” movement
Science is about evidence: observing, measuring, collecting, and analyzing evidence. The evidence,
methods of collecting and analyzing that evidence, and the conclusions reached should be open to
the scrutiny and evaluation of other scientists. In this way, scientific knowledge can be self-correcting.
Beginning in about 2010, there were questions about the integrity of the experimental practices of
psychological sciences. Concerns about non-replicability, post-hoc theorizing, inappropriate use of
statistics, lack of access to materials and data, file drawers, and even fraud.
→ These things had bothered psychologists in the past, but the size, range and timing of the recent
events were particularly unsettling.
The current open science movement grew out of the concerns about the integrity of psychological
science.
Open science is a term for some of the proposed reforms to make scientific practices more
transparent and to increase the availability of information that allows others to evaluate and use the
research.
Many of the proposed reforms of the open science movement are not new, but for various reasons
they seem to be catching on now, whereas they have not done so in the past.
Major motivations for concern
Several different types of events provoked the field to action.
- Of major importance was the publication of two articles:
- Bem (2011): purported to show evidence of precognition (accurately anticipating
future chance events). This was a serious research project.
- Simmons, Nelson, and Simonsohn (2011): Showed evidence of something even
more magical (see questionable research practices, research project about Beatles
music and age).
- The growing list of high-visibility studies that could not be replicated. This problem provided a
name for the widespread concern about the robustness of scientific research → the
replication crisis.
- The revelation of fraud committed by several well-known psychologists.
Questionable research practices
- “False-Positive psychology: Undisclosed flexibility in data collection and analysis allows
presenting anything as significant” (Simmons et al., (2011).
- This paper illustrates that because of the leeway built into design, analysis and
reporting of studies, researchers can ‘discover’ just about anything.
- Two experiments:
- Undergraduate participants were randomly assigned to listen to either a
children’s song or a control song. Afterwards they reported how old they feld
and their father’s age. Analysis (ANCOVA) revealed that participants felt
significantly older after listening to the children’s song.
- In the second experiment, different participants listened to the Beatles’ song
“When I’m Sixty-Four” or the same control song from experiment 1.
Afterwards they reported their birthdate and their father’s age. An ANCOVA
revealed that participants who had listened to the Beatles’ song were
significantly younger than the other participants. Note: The result is not that
they felt younger, but that they were younger. Given the randomized
, 2
controlled design of the study, the conclusion has to be: Listening to the
Beatles’ song caused people to become younger.
- The moral of this paper was that this result could only be reached through
undisclosed flexibility (collect unreported measures, choose a sample size, choose
which covariates to use in analysis and which test statistics to report, change
hypotheses post hoc to fit data, and other questionable techniques).
- Psychology’s approach for decades has been to advise HARKing, and many of the practices
described in Simmons et al (2011) were not only accepted, but also encouraged by the field
of psychology
- Authors are also implored to explore their data, and at the same time, they are advised to
rewrite the history of that exploration for the sake of narrative.
- Many reviewers and editors have asked authors to drop nonsignificant manipulations and
measures from their manuscripts.
- O’Boyle et al. (2014): compared management dissertations to their corresponding
published journal articles and found that the ratio of supported and unsupported
hypotheses more than doubled due to dropping nonsignificant findings, adding of
changing hypotheses, and altering data.
- In a survey, many researchers revealed that they knew of people who had engaged in some
of these practices. Though falsifying data was rated as neither prevalent or defensible, a few
practices were prevalent (failing to report all dependent measures, reporting only studies that
confirmed the hypothesis, data peeking), and many were viewed as defensible.
- John et al. reinstantiated a label for such behaviors: Questionable research practices (QRPs).
- The use of QRPs could explain large-scale failures to replicate.
Failures to replicate
Although there have always been failures to replicate, now many of them were occurring in multiple
labs, and there was increasing recognition that the failures were not isolated to single labs or lines of
research.
Many of these failures to replicate were experiments regarding social priming effects.
Publishing venues became available in the mid-00’s. It became possible to publish replication studies.
- But, when researchers couldn’t replicate the results of an earlier study, the blame often fell on
the researchers trying to do the replication. At the other extreme, the blame sometimes fell on
the original researchers with suspicions that they had failed to communicate all that was
necessary to run the study correctly, or that they had used QRPs or perpetrated some type of
fraud. And there was the third non-judgemental possibility that the subject populations or the
world had changed so much that the study would not have the same meaning to the present
participants.
To systematically examine the extent and causes of failures to replicate, the early 2010s ushered in
multiple coordinated, open replication efforts.
- One critique of these replication attempts was that most were done by single labs and thus
failures to replicate could be attributable to lab-specific idiosyncrasies or simply statistical
flukes.
As a result, psychology began to combine resources to conduct highly powered replications across
multiple lab settings and populations.
- The many labs project brought together dozens of diverse lab sites to replicate a small set of
short, simple studies administered by computer of paper survey. To ensure study quality and
mitigate any researcher bias, study methods and materials were peer-reviewed prior to data
collection and when possible vetted by the original authors.
, 3
- In Many Labs 1, 10 of 13 effects were replicated as measured by 99% confidence
intervals. These results suggested that many classic findings were reliable and that
some effects were indeed reproducible. These studies were not randomly selected,
so this high rate of replicability might not generalize to the field of psychology more
broadly.
- Replication from a follow-up many labs project assessing more contemporary effects
were more discouraging. Several other pre-approved, large-scale replication called
Registered Replication Reports (RRRs) produced similarly disappointing results.
Perhaps the most well-publicized large-scale replication project was the Reproducibility Project:
Psychology.
The Reproducibility Project aimed to estimate reproducibility more generally by coordinating single
labs that each replicated one of 100 semi-randomly selected findings. To decrease selection bias, the
replicated studies were chosen from articles published in 2008 in three top journals. To make the
process transparent, methods and analysis plans were pre-registered prior to data collection. To
increase fidelity, the replication teams worked with original authors whenever possible to recreate the
original studies as closely as they were able.
- Whereas 97 of the original findings were statistically significant, only 35 of the replicated
findings were significant.
Regardless of the outcomes, the process of trying to do these replications was quite informative.
First, researchers had to acknowledge the difficulty in doing direct replications. Much is left out of the
descriptions of a journal and much is left for guessing.
Second, reproducing psychologically equivalent procedures and measures across populations and
time sometimes proved challenging.
Third, the field had to acknowledge that there was no obvious way to interpret what it meant to have a
successful replication.
Although widely noted failures to replicated prompted the name of replication crisis, the subsequent
rise in large-scale replications also garnered substantial attention and gave rise to a more optimistic
name for the focus on improving science → the replication revolution.
Fraud
The most shocking event, at least for social psychologists, was the revelation of fraud by Diederik
Stapel.
At about the same time, investigations were continuing into misconduct by the Harvard
cognitive-evolutionary psychologist Marc Hauser. And in 2011 and 2012, close analysis by Uri
Simonsohn led to university findings of misconduct and then retractions by the social psychologists
Lawrence Sanna and Dirk Smeesters.
Other motivations for concern
Lack of access to full methods
Researchers wanting to replicate or better understand others’ studies were frustrated by their inability
to obtain the exact materials and detailed methods used by the original researchers.
All empirical articles contain method sections, but these were often short and incomplete, particularly
with the strict adherence to word limits in the short-form articles that had grown in prevalence.
Some scientific fields make a distinction between replication (rerunning a study with the same protocol
to gather new data to see whether the same result is obtained) and reproduction (reanalyzing the
original data to see whether the same result is obtained).
Psychologists have been mainly concerned with replication, but access to original data can be useful.
, 4
Prior to the advent of modern computing technologies, psychological researchers did not routinely
archive their data. Records were maintained for some minimal amount of time in personal file cabinets
and boxes, after which they were discarded.
With computing power and online cloud storage increasing, and with more attempts at working
cumulatively, researchers were becoming more frustrated by their inability to obtain the data from
published studies for reanalysis or inclusion in meta-analysis, despite publication guidelines stating
that authors should be willing to share data for such purposes.
Lack of access to analytic procedures and code
The rise of point-and-click statistical software meant that more researchers than ever had access to
advanced analytic techniques. Such software does not force researchers to save the commands they
used to complete their data analysis, and many psychologists failed to preserve the analysis code,
rendering exact reproduction of analyses and their results challenging even when using the same
data as input.
The file drawer problem
Rosenthal (1979): the file drawer problem refers to the fact that some research never ultimately
makes it into the published literature and instead languishes in researchers’ file drawers.
In addition to replications, null results and failures to support a hypothesis rarely appear in print.
Evidence suggests that researchers might be changing their hypotheses after the results are known to
convert null results into hypothesis-supporting results.
Lack of access to publications
Researchers were also becoming frustrated by the fact that in a free-access cyberworld most
scientific publications are only available for a fee.
Discontent with reporting and use of standard statistics
There has long been discontent with the use of null hypothesis significance testing and with the
reporting of statistics in psychology journals.
This dissatisfaction intensified after the publication of Bem’s (2011) precognition paper (which also
used null hypotheses)
Summary and sequel
The concerns mentioned earlier were not independent of each other and many result from the
problem of misaligned incentives: what is good for being a successful scientist is not necessarily what
is good for science itself.
Scientific practices might therefore be difficult to change if institutions don’t change reward structures.
Psychologists thus realized that making science open would take efforts on the part of various types
of stakeholders, some of whom might be resistant to change.
Why the time is ripe for change
Most of the problems just described are not new to the field of psychology, but previous attempts to fix
them have failed.
What is different now that is allowing current reforms to take off and take hold?