P-hacking and the quest for statistical significance

For many scientists, there is a God, and its name is “p.” The p-value, an indicator of statistical significance, has become something of a benchmark for scientific publication — and it’s destroying the credibility of science.

Say I’m a scientist studying whether or not soccer referees are biased against players with darker skin tones. I’m hypothesizing that such players are disproportionately red-carded, so I’ll analyze my data and run a statistical regression. At the end, I’ll get a p-value: the percentage chance that my results would happen if there’s no bias. A p-value less than 0.05 is the Holy Grail: a less-than-5-percent chance of a given result randomly occurring is considered statistically significant.

This study actually happened. A University of Virginia researcher culled a data set that included more than 140,000 soccer matches and sent it out to 29 different data analytics groups. The results were all across the board: Twenty groups found there was a positive association between darker skin tone and more red cards, ranging from weak to very strong, and nine teams found no relationship at all. From the same data, they all derived different p-values (and conclusions). And that was the point of the study — it was never really about soccer at all.

The lesson here is data values do not speak for themselves. In interpreting research outcomes, scientists must make decisions that are inherently subjective, and that changes the p-value at the end. Most published data have been analyzed just once. That gives researchers the freedom to choose an analytical method that gives the results they want.

“P-hacking” is the process of selecting variables and defining terms to create a significant p-value — and scientists have every incentive to produce statistically significant results. Psychologist Uri Simonsohn analyzed all of the reported p-values in a large set of psychology papers and found that they cluster suspiciously close to 0.05 — what we would expect if scientists shoot for that target in how they choose to do their analysis.

The problem is p-values don’t tell you whether or not your hypothesis is correct or important. They’re just the odds that the outcome happens given certain assumptions. But when they’re used as benchmarks for publication, researchers have strong incentive to artificially create significance by manipulating variables and sample sizes.

Contrived data — or p-hacked data — is a huge problem. A 2005 meta-analysis of science determined it’s more likely for a scientific claim to be false than true. That is to say, more than half of published results can’t be replicated.

That’s staggering. It’s also the result of a system that lavishes rewards on novel results and is much less concerned with replication studies. It’s difficult to convince a researcher to spend his or her hard-earned grant money to prove other peoples’ studies false, when instead they could be p-hacking their way to publication glory.

That’s not to say there isn’t hope. The journal of Basic and Applied Social Psychology has decided to do away with p-values entirely in favor of “strong descriptive statistics.” Blogs like Retraction Watch monitor papers redacted due to scientific impropriety and shame disingenuous researchers. “Open-source science” is becoming a real movement.

Statistics are only part of the picture. Every data set tells a story, and scientists have an ethical obligation to interpret it within the boundaries of common sense and integrity. That might mean ditching p-values for good.

Jack Siglin is a senior physiology and neurobiology major. He can be reached at jsiglindbk@gmail.com.

P-hacking and the quest for statistical significance

Recommended Articles

Sydney Sweeney’s ad is just the start of a larger wave of white nationalism

Student success starts with putting the phone away in the classroom

If AI is making our choices, UMD must teach us how to take them back