Today more than ever, decisions are made using statistical tests. Researchers use them to decide whether a new drug is more effective than a placebo; companies use them to interpret those increasingly ubiquitous A/B tests. After all, statistical tests offer a clearheaded, rational, and unbiased way to make sense of data.
Or do they?
According to new research by Blakeley McShane, an associate professor of marketing at the Kellogg School, one of the most popular statistical tests, the null hypothesis significance test, is often interpreted incorrectly. Researchers are biased to pay attention only to whether a test is “statistically significant,” which by convention occurs when a measure known as the test’s p-value is below 0.05. Should the p-value fall below the 0.05 threshold, a new drug or corporate strategy is deemed a success; should it fall just above, it is treated as worthless.
But this all-or-nothing view of hypothesis testing is wrong, says McShane. “The p-value is actually a continuous measure of evidence”— meaning that a p-value just above the 0.05 threshold is essentially identical to one just below it, and thus offers similar support for a conclusion. “Nonetheless, in practice people treat p-values dichotomously.”
McShane and his coauthor, David Gal of the University of Illinois at Chicago, find that, ironically, experts are more likely than novices to be blinded by the glare of statistical significance, dismissing a study whose p-value is just a wee bit above the magical threshold.
“These aren’t freshmen enrolled in Stat 101,” says McShane. “These are the editorial board of Psychological Science and authors from the New England Journal of Medicine and the American Economic Review—top journals in their respective fields.” And yet they may be ignoring promising ideas and treatments.
The Gold Standard
For obvious reasons, researchers cannot test whether a drug works on all potential patients. It is similarly impractical for a fast-food chain to test an ad’s potency on all conceivable customers. So instead, researchers try out new drugs and ads on representative samples of patients and customers. Then, they use hypothesis tests to assess whether any effects observed in that sample are likely to be “real” or whether they are consistent with being a chance occurrence.
When a hypothesis test yields a p-value below 0.05, researchers declare their result statistically significant and believe they have found a real effect. This 0.05 value is taken very seriously. “Why? Well, when you get a p-value below 0.05, you’ve struck gold and can publish your results!” says McShane. “But if it’s above 0.05, you’re hosed and it’s back to the drawing board.”
“Everybody knows the 0.05 standard is a fairly arbitrary convention. Nonetheless, you get to anoint your results statistically significant and publish away.”
However, there is nothing magical about the number 0.05—except perhaps how dominant it has become in practice. “Everybody knows the 0.05 standard is a fairly arbitrary convention,” says McShane. “Nonetheless, you get to anoint your results statistically significant and publish away.”
McShane and Gal wondered whether the 0.05 threshold has become so engrained in the minds of researchers that they would simply discount any results that were over that threshold.
So in their first set of experiments, they provided a summary of a hypothetical study to hundreds of academic and industry researchers, including the editorial boards of prestigious journals as well as authors who publish in them. The study compared two different treatments for terminal patients and tracked how long they lived after diagnosis. Critically, some researchers saw a summary that included a p-value below the 0.05 threshold, while others saw an identical summary, except the p-value was above the threshold.
Next, participants were asked a simple descriptive question about the study—one that did not require any statistical training to answer and for which the p-value was entirely irrelevant. “We just asked which group of subjects ended up living longer on average, those who were assigned to take treatment A or those who were assigned to take treatment B,” says McShane.
McShane and Gal found that the p-value had a huge impact on the researchers’ responses—even though it was irrelevant to the question. “About 90% correctly described the data when we gave them a p-value that was below 0.05,” McShane says, “while only about 15% did when we gave them one that was above 0.05.”
Put another way, the participants correctly answered that those who took treatment A lived longer on average than those who took treatment B when the p-value indicated statistical significance; but, they failed to identify the difference in average lifetimes when the p-value missed statistical significance.
It was their expertise that doomed them. Almost 75% of naïve participants—undergraduates who had never taken a statistics class—answered the same question correctly regardless of whether the p-value was statistically significant or not.
“They didn’t know what the p-value meant, so they ignored it,” says McShane. “Since it didn’t matter in this case, that was the right thing to do, so it really behooved them.”
A Stubborn Tendency
In another set of experiments, McShane and Gal found that the notion of statistical significance clouded researchers’ judgments when they were asked which of two drugs was more likely to be more effective for a hypothetical new patient. When the p-value was statistically significant, about 80% of participants judged that a hypothetical new patient would be better off with the drug that performed better in the study (Drug A); when the p-value was not statistically significant, that percentage dropped to about 20%.
Perhaps even more alarmingly, efforts to make the scenario hit closer to home—by asking researchers to pretend that they themselves were the patient and to choose Drug A, Drug B, or express indifference—still revealed a steep drop based on the p-value: about 90% chose Drug A when the p-value was statistically significant versus about 50% when it was not. With their own lives at stake, participants were somewhat more willing to go with the more effective treatment when the study was not statistically significant. But “we still get a big drop, reflecting a focus on statistical significance—even when they are making a personally consequential choice,” says McShane.
Moreover, participants responded to the questions the same way when presented with a p-value that just barely missed the 0.05 threshold as they did when presented with one that missed it by leaps and bounds. “As we further and further increased the p-value, our participants’ responses to either question did not change,” says McShane, “even though, as the p-value goes up and up, the evidence in favor of one drug over the other is weaker and weaker.”
Even manipulating the difference in effectiveness between the two drugs in the study—Drug A curing 20 percentage points more patients than Drug B versus just 8 percentage points more, for instance—did not sway researchers from focusing solely on statistical significance.
“When we varied the magnitude of the treatment difference—something you should really care about if you are a patient—there was essentially no impact on our results,” says McShane. “Our participants seemed to focus exclusively on the p-value—and not just on the p-value itself, but on whether or not it was below 0.05.”
Beyond Magical Thinking
This finding has obvious relevance to researchers, who may be overlooking powerful treatments, solutions, and ideas because their p-value fails to pass an arbitrary threshold. But steering clear of magical thinking about 0.05 may be easier said than done: misunderstandings have long plagued the science community. In the article, McShane says, “we cite things from the 50s, 60s, 70s, 80s, 90s, and 2000s that all make more or less the same point” about the arbitrariness of the 0.05 threshold.
McShane and Gal’s work also has increasing relevance for business practitioners who conduct tests to better understand customers.
Of course, A/B testing is nothing new, McShane says. “Retail-catalogue companies like Eddie Bauer and Lands’ End have been running tests for decades, for instance, sending out different versions of the catalog—perhaps different prices, perhaps different photographs, etc.—to see which worked best. And Crayola was testing email campaigns to drive traffic to their website almost twenty years ago.”
As operations have increasingly shifted online, corporate experimentation has correspondingly exploded. Today, almost everything that can be quantified is also being tested.
So what can businesses do to ensure that they are not blinded by statistical significance? “Take a more holistic view,” says McShane. Rather than focusing single-mindedly on p-values or any other statistical results, managers should also take into account the context in which the results were derived. How well was the test designed? How representative was the sample that was tested? Is there related evidence from similar studies or historical data? And what are the real-world costs and benefits of implementing a new strategy?
“You might have a statistically significant p-value, but how much do you think revenue will increase if you implement the new strategy, how much will it cost, and will switching gears frustrate your customers?” says McShane. “Although they can be difficult to quantify, the real-world costs and benefits really matter—as much as and probably more than the result of the hypothesis test.” And these costs and benefits can vary wildly by industry.
Finally, while it may be tempting to throw a naïve undergraduate onto your data-science team to thwart p-value fever, McShane does not recommend this. “Evaluating uncertain evidence is really hard. This is the rare case where the naïve person performs better,” he says.