False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows
Presenting Anything as Significant – Joseph. P. Simmons, Leif D. Nelson, Uri Simonsohn
Abstract
In this article, we accomplish two things. First, we show that despite empirical psychologists’ nominal
endorsement of a low rate of false-positive findings (≤ .05), flexibility in data collection, analysis, and
reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely
to falsely find evidence that an effect exists than to correctly find evidence that it does not. We
present computer simulations and a pair of actual experiments that demonstrate how unacceptably
easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second,
we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this
problem. The solution involves six concrete requirements for authors and four guidelines for
reviewers, all of which impose a minimal burden on the publication process.
Perhaps the most costly error is a false positive, the incorrect rejection of a null hypothesis. First,
once they appear in the literature, false positives are particularly persistent. Because null results
have many possible causes, failures to replicate previous findings are never conclusive. Furthermore,
because it is uncommon for prestigious journals to publish null findings or exact replications,
researchers have little incentive to even attempt them. Second, false positives waste resources: They
inspire investment in fruitless research programs and can lead to ineffective policy changes. Finally, a
field known for publishing false positives risks losing its credibility.
It is unacceptably easy to publish “statistically significant” evidence consistent with any hypothesis.
The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and
analyzing data, researchers have many decisions to make.
It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather,
it is common (and accepted practice) for researchers to explore various analytic alternatives, to
search for a combination that yields “statistical significance,” and to then report only what “worked.”
The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely
positive finding at the 5% level is necessarily greater than 5%.
This exploratory behavior is not the by-product of malicious intent, but rather the result of two
factors: (a) ambiguity in how best to make these decisions and (b) the researcher’s desire to find a
statistically significant result.
How Bad Can It Be? A Demonstration of Chronological Rejuvenation
To help illustrate the problem, we conducted two experiments designed to demonstrate something
false: that certain songs can change listeners’ age.
In Study 2, we sought to conceptually replicate and extend Study 1. Having demonstrated that
listening to a children’s song makes people feel older, Study 2 investigated whether listening to a
song about older age makes people actually younger.
, Discussion
These two studies were conducted with real participants, employed legitimate statistical analyses,
and are reported truthfully. Nevertheless, they seem to support hypotheses that are unlikely (Study
1) or necessarily false (Study 2).
“How Bad Can It Be?” Simulations
Simulation of common researcher degrees of freedom
We used computer simulations of experimental data to estimate how researcher degrees of freedom
influence the probability of a false-positive result. These simulations assessed the impact of four
common degrees of freedom: flexibility in (a) choosing among dependent variables, (b) choosing
sample size, (c) using covariates, and (d) reporting subsets of experimental conditions. We also
investigated various combinations of these degrees of freedom. A researcher is more likely than not
to falsely detect a significant effect by just using these four common researcher degrees of freedom.
A closer look at flexibility in sample size
Researchers often decide when to stop data collection on the basis of interim data analysis. Notably,
a recent survey of behavioral scientists found that approximately 70% admitted to having done so.
The study contradicts the often-espoused yet erroneous intuition that if an effect is significant with a
small sample size then it would necessarily be significant with a larger one.
Solution
As a solution to the flexibility-ambiguity problem, we
offer six requirements for authors and four guidelines
for reviewers (Table 2). Our solution leaves the right
and responsibility of identifying the most appropriate
way to conduct research in the hands of researchers,
requiring only that authors provide appropriately
transparent descriptions of their methods so that
reviewers and readers can make informed decisions
regarding the credibility of their findings. We assume
that the vast majority of researchers strive for
honesty; this solution will not help in the unusual case
of willful deception.
We also propose the following guidelines for
reviewers: Table 2.
General Discussion
Criticisms
Criticism of our solution comes in two varieties: It does not go far enough and it goes too far.
Not far enough