Replicability Crisis

4 minute read

I remember standing in a bathroom with my arms on my hips in a superman pose, staring at the mirror. “This feels weird” said one side of my brain. “But…science!” went the other. And so there I stood, posing in the bathroom, minutes before my interview was supposed to start.

40 million views and an outrageously successful TED talk later, power posing had come to the mainstream in 2014. This was the idea that posing could boost feelings of confidence, hence my ritual before interviewing [1] And then it all went to crap after multiple critiques, including one from a co-author of the paper, saying the effect didn’t exist. And now apparently power posing is back again. [2]

Similar stories have played out across multiple research findings. [3] How should this affect what we believe in? I’ve previously written on how our beliefs could be problematic, even for those adopting an evidence-based approach. The ongoing replication crisis in science is one reason why I feel that way. This video goes into the details of how a lot of scientific research papers tend to be wrong. There’s even a website that tracks papers that have been retracted. Ed Yong of the Atlantic also has a well sourced article

Part of this should be expected in science. After all, the scientific method is a continuous process, and we should be updating outdated information with newer results. The replication crisis seems to go beyond what should be expected though. I find it problematic when I read how “It can be proven that most claimed research findings are false”. As described in the video or papers linked above, there seem to be a few causes for this crisis, including prevalent p-hacking, incentive misalignment, or the replication studies themselves being inaccurate.

By p-hacking, I mean the flexibility in data collection, analysis, and reporting that could increase false positive rates in experiments. A simple example would be if you had 98 students in your experiment, and your current results are just above the threshold of statistical significance. Why not add 2 more to ‘round things off’? After all, you reason, 2 students originally dropped out, so this shouldn’t be an issue. Small decisions that seem innocious end up skewing results.

A related issue is how most studies are done on a small cultural sample size. If you’ve taken part in a psych class in college, you probably remember having to take part in experiments in order to fulfill quotas for lab rats test subjects. Given the amount of studies that use university volunteers as the sample, it’s understandable why findings might not replicate more generally. [4] More than 90 percent of research comes from studies on countries representing less than 15 percent of the population, which is obviously an issue if you want to say your result is a fundamental human finding rather than limited to a specific background. [5]

What worsens this is that incentives are not aligned for paper publication. As I strongly believe in, incentive misalignment causes bad outcomes. Scientists are incentivised to publish papers, and publishing new observations is more interesting than repeating older work. If people feel they should aim for quantity over quantity, and if journals prefer new rather than the old, there’s a structural incentive to push papers with interesting but less rigorous results.

The replication crisis is not limited to science either, with financial anomalies usually failing to replicate too. Here though, part of the reason could be the anomalies being arbitraged away, so perhaps this is less problematic. [6] We’ll leave the efficient market hypothesis to discuss for another day.

What can be done about this? I’m not sure. [7] Better statistics education would help, although given how people still struggle with p-values and I have difficulty with a goat, [8] I’m not sure how effective this may be. Some of the papers linked above have suggested solutions too, including better mentoring, better experiment design, and better teaching.

I am all for science and the scientific method, but the replicability crisis is problematic. I don’t know what solutions will be more effective, though I’d hope for more incentive alignment between science (learning something) and scientists (publishing something). I think that solving that first probably goes a long way towards reducing false positives.

Footnotes:

  1. Incidentally, did not work at all. My interviews that year all went horribly. I put the blame squarely on power posing and not myself.
  2. With caveats around the feeling of power vs hormonal changes after posing
  3. Another famous example is the hot hand fallacy, which might not be a fallacy anymore
  4. I don’t actually know the percentage, but would assume majority of papers involve conscripted university undergrads
  5. Perhaps social science should move towards more segmented, cultural conclusions instead of theories that are applicable across all humanity, as some have argued. I don’t know enough to form a point of view here. Science should be universal, but perhaps social science is not?
  6. Separately, Cliff Asness argues about the robustness of certain anomalies in investing here
  7. If it wasn’t abundantly clear by now, I’m not a scientist. Unless my primary school science center badge counts. I’d like to think it does but have unfortunately been informed otherwise
  8. Before anyone emails, yes I do understand the monty hall problem. Just trying to make a point here that it’s not intuitive