Vox Future Perfect based their last newsletter on the blog post. (copied in full as I can’t find a page for it)
Much ink has been spilled over the replication crisis in the last decade and a half, including here at Vox. Researchers have discovered, over and over, that lots of findings in fields like psychology, sociology, medicine, and economics don’t hold up when other researchers try to replicate them.
This conversation was most famously kicked off by John Ioannidis’s 2005 article “Why Most Published Research Findings Are False,“ but since then many researchers have explored it from different angles. Why are research findings so often unreliable? Is the problem just that we test for “statistical significance” in a nuance-free way? Is it that null results (that is, when a study finds no detectable effects) are ignored while positive ones make it into journals?
Is it that social science researchers don’t have the statistics training to adjust for multiple comparisons and other common statistical mistakes? And now that we’ve realized the extent of errors in published papers, will it be easy to improve?
Some recent evidence has led me to a depressing conclusion: The reasons that lead to unreliable research findings are routine, well understood, predictable, and in principle pretty easy to avoid. And yet we’re not actually improving at all.
That’s the takeaway from a recent study that found that lay people do quite well at predicting which published research will or won’t replicate, and a fascinating blog post by Alvaro de Menard, a participant in DARPA’s replication markets project, a project by the US military to design better tools for predicting which research will hold up.
The conclusion both of them reach about the replication crisis: It’s not very mysterious.
Just carefully reading a paper — even as a layperson without deep knowledge of the field — is sufficient to form a pretty accurate guess about whether the study will replicate.
Meanwhile, DARPA’s replication markets found that guessing which papers will hold up and which won’t is often just a matter of looking at whether the study makes any sense. Some important statistics to take note of: Did the researchers squeeze out a result barely below the significance threshold of p = 0.05? (A paper can often claim a “significant” result if this threshold is met, and many use various statistical tricks to push their paper across that line.) Did they find no effects in most groups but significant effects for a tiny, hyper-specific subgroup?
“Predicting replication is easy,” Menard writes. “There’s no need for a deep dive into the statistical methodology or a rigorous examination of the data, no need to scrutinize esoteric theories for subtle errors—these papers have obvious, surface-level problems.”
And the bad news is that this hasn’t changed anything! We’ve known for at least a decade about the replication crisis. Many researchers have worked full time on explaining the statistical mistakes that make papers unreliable, proposing and pushing for new methods that should work better, and tracking the reasons papers are retracted or refuted.
Here at Vox, we’ve written about how the replication crisis can guide us to do better science. And yet blatantly shoddy work is still being published in peer-reviewed journals despite errors that a layperson can see. Journals effectively aren’t held accountable for bad papers — many, like the Lancet, have retained their prestige even after a long string of embarrassing public incidents where they published research that turned out fraudulent or nonsensical.
Furthermore, once published, the obviously bad papers are cited by others. Menard finds almost no correlation between a paper’s probability of replicating and its number of citations. It’s hard to guess why, but he theorizes that many scientists don’t thoroughly check — or even read — papers once published, expecting that if they’re peer-reviewed, they’re fine. So bad papers are published — and once they’re published, no one is looking closely enough to care that they’re bad papers.
That’s discouraging and infuriating. It suggests that the replication crisis isn’t one specific methodological reevaluation, but a symptom of a scientific system that needs rethinking on many levels. We can’t just teach scientists how to write better papers, we also need to change the fact that those better papers aren’t cited more often than bad papers, that bad papers are almost never retracted even when their errors are visible to any lay reader, and that there are no consequences for bad research.
We need a more sophisticated understanding of the replication crisis, not as a moment of realization after which we were able to move forward with higher standards, but as an ongoing rot in the scientific process which a decade of work hasn’t fixed.
There should be some caveats. Science has done, and is still doing, great things. Many scientists do important and valuable work — and in a way, the fact that bad papers are identifiable by lay people is good news because it means we have the tools to identify and appreciate the good stuff, too.
Our scientific institutions are valuable and we need, more than ever, the tools they’ve built to help us understand the world. There’s no cause for hopelessness here, even if some frustration is very thoroughly justified. Science needs saving, sure — but science is very much worth saving.