Today more than ever, decisions are made using statistical tests. Researchers use them to decide whether a new drug is more effective than a placebo; companies use them to interpret those increasingly ubiquitous A/B tests. After all, statistical tests offer a clearheaded, rational, and unbiased way to make sense of data.
to your inbox.
Or do they?
According to new research by Blakeley McShane, an associate professor of marketing at the Kellogg School, one of the most popular statistical tests, the null hypothesis significance test, is often interpreted incorrectly. Researchers are biased to pay attention only to whether a test is “statistically significant,” which by convention occurs when a measure known as the test’s p-value is below 0.05. Should the p-value fall below the 0.05 threshold, a new drug or corporate strategy is deemed a success; should it fall just above, it is treated as worthless.
But this all-or-nothing view of hypothesis testing is wrong, says McShane. “The p-value is actually a continuous measure of evidence”— meaning that a p-value just above the 0.05 threshold is essentially identical to one just below it, and thus offers similar support for a conclusion. “Nonetheless, in practice people treat p-values dichotomously.”
McShane and his coauthor, David Gal of the University of Illinois at Chicago, find that, ironically, experts are more likely than novices to be blinded by the glare of statistical significance, dismissing a study whose p-value is just a wee bit above the magical threshold.
“These aren’t freshmen enrolled in Stat 101,” says McShane. “These are the editorial board of Psychological Science and authors from the New England Journal of Medicine and the American Economic Review—top journals in their respective fields.” And yet they may be ignoring promising ideas and treatments.
The Gold Standard
For obvious reasons, researchers cannot test whether a drug works on all potential patients. It is similarly impractical for a fast-food chain to test an ad’s potency on all conceivable customers. So instead, researchers try out new drugs and ads on representative samples of patients and customers. Then, they use hypothesis tests to assess whether any effects observed in that sample are likely to be “real” or whether they are consistent with being a chance occurrence.
When a hypothesis test yields a p-value below 0.05, researchers declare their result statistically significant and believe they have found a real effect. This 0.05 value is taken very seriously. “Why? Well, when you get a p-value below 0.05, you’ve struck gold and can publish your results!” says McShane. “But if it’s above 0.05, you’re hosed and it’s back to the drawing board.”
“Everybody knows the 0.05 standard is a fairly arbitrary convention. Nonetheless, you get to anoint your results statistically significant and publish away.”
However, there is nothing magical about the number 0.05—except perhaps how dominant it has become in practice. “Everybody knows the 0.05 standard is a fairly arbitrary convention,” says McShane. “Nonetheless, you get to anoint your results statistically significant and publish away.”
McShane and Gal wondered whether the 0.05 threshold has become so engrained in the minds of researchers that they would simply discount any results that were over that threshold.
So in their first set of experiments, they provided a summary of a hypothetical study to hundreds of academic and industry researchers, including the editorial boards of prestigious journals as well as authors who publish in them. The study compared two different treatments for terminal patients and tracked how long they lived after diagnosis. Critically, some researchers saw a summary that included a p-value below the 0.05 threshold, while others saw an identical summary, except the p-value was above the threshold.
Next, participants were asked a simple descriptive question about the study—one that did not require any statistical training to answer and for which the p-value was entirely irrelevant. “We just asked which group of subjects ended up living longer on average, those who were assigned to take treatment A or those who were assigned to take treatment B,” says McShane.
McShane and Gal found that the p-value had a huge impact on the researchers’ responses—even though it was irrelevant to the question. “About 90% correctly described the data when we gave them a p-value that was below 0.05,” McShane says, “while only about 15% did when we gave them one that was above 0.05.”
Put another way, the participants correctly answered that those who took treatment A lived longer on average than those who took treatment B when the p-value indicated statistical significance; but, they failed to identify the difference in average lifetimes when the p-value missed statistical significance.
It was their expertise that doomed them. Almost 75% of naïve participants—undergraduates who had never taken a statistics class—answered the same question correctly regardless of whether the p-value was statistically significant or not.
“They didn’t know what the p-value meant, so they ignored it,” says McShane. “Since it didn’t matter in this case, that was the right thing to do, so it really behooved them.”
A Stubborn Tendency
In another set of experiments, McShane and Gal found that the notion of statistical significance clouded researchers’ judgments when they were asked which of two drugs was more likely to be more effective for a hypothetical new patient. When the p-value was statistically significant, about 80% of participants judged that a hypothetical new patient would be better off with the drug that performed better in the study (Drug A); when the p-value was not statistically significant, that percentage dropped to about 20%.
Perhaps even more alarmingly, efforts to make the scenario hit closer to home—by asking researchers to pretend that they themselves were the patient and to choose Drug A, Drug B, or express indifference—still revealed a steep drop based on the p-value: about 90% chose Drug A when the p-value was statistically significant versus about 50% when it was not. With their own lives at stake, participants were somewhat more willing to go with the more effective treatment when the study was not statistically significant. But “we still get a big drop, reflecting a focus on statistical significance—even when they are making a personally consequential choice,” says McShane.
Moreover, participants responded to the questions the same way when presented with a p-value that just barely missed the 0.05 threshold as they did when presented with one that missed it by leaps and bounds. “As we further and further increased the p-value, our participants’ responses to either question did not change,” says McShane, “even though, as the p-value goes up and up, the evidence in favor of one drug over the other is weaker and weaker.”
Even manipulating the difference in effectiveness between the two drugs in the study—Drug A curing 20 percentage points more patients than Drug B versus just 8 percentage points more, for instance—did not sway researchers from focusing solely on statistical significance.
“When we varied the magnitude of the treatment difference—something you should really care about if you are a patient—there was essentially no impact on our results,” says McShane. “Our participants seemed to focus exclusively on the p-value—and not just on the p-value itself, but on whether or not it was below 0.05.”
Beyond Magical Thinking
This finding has obvious relevance to researchers, who may be overlooking powerful treatments, solutions, and ideas because their p-value fails to pass an arbitrary threshold. But steering clear of magical thinking about 0.05 may be easier said than done: misunderstandings have long plagued the science community. In the article, McShane says, “we cite things from the 50s, 60s, 70s, 80s, 90s, and 2000s that all make more or less the same point” about the arbitrariness of the 0.05 threshold.
McShane and Gal’s work also has increasing relevance for business practitioners who conduct tests to better understand customers.
Of course, A/B testing is nothing new, McShane says. “Retail-catalogue companies like Eddie Bauer and Lands’ End have been running tests for decades, for instance, sending out different versions of the catalog—perhaps different prices, perhaps different photographs, etc.—to see which worked best. And Crayola was testing email campaigns to drive traffic to their website almost twenty years ago.”
As operations have increasingly shifted online, corporate experimentation has correspondingly exploded. Today, almost everything that can be quantified is also being tested.
So what can businesses do to ensure that they are not blinded by statistical significance? “Take a more holistic view,” says McShane. Rather than focusing single-mindedly on p-values or any other statistical results, managers should also take into account the context in which the results were derived. How well was the test designed? How representative was the sample that was tested? Is there related evidence from similar studies or historical data? And what are the real-world costs and benefits of implementing a new strategy?
“You might have a statistically significant p-value, but how much do you think revenue will increase if you implement the new strategy, how much will it cost, and will switching gears frustrate your customers?” says McShane. “Although they can be difficult to quantify, the real-world costs and benefits really matter—as much as and probably more than the result of the hypothesis test.” And these costs and benefits can vary wildly by industry.
Finally, while it may be tempting to throw a naïve undergraduate onto your data-science team to thwart p-value fever, McShane does not recommend this. “Evaluating uncertain evidence is really hard. This is the rare case where the naïve person performs better,” he says.
McShane, Blakeley B. 2015. Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence. Management Science, 62 (6), 1707-1718.
Will AI Eventually Replace Doctors?Maybe not entirely. But the doctor–patient relationship is likely to change dramatically.
3 Tips for Reinventing Your Career After a LayoffIt’s crucial to reassess what you want to be doing instead of jumping at the first opportunity.
What Happens to Worker Productivity after a Minimum Wage Increase?A pay raise boosts productivity for some—but the impact on the bottom line is more complicated.
6 Takeaways on Inflation and the Economy Right NowAre we headed into a recession? Kellogg’s Sergio Rebelo breaks down the latest trends.
What Is the Purpose of a Corporation Today?Has anything changed in the three years since the Business Roundtable declared firms should prioritize more than shareholders?
How to Get the Ear of Your CEO—And What to Say When You Have ItEvery interaction with the top boss is an audition for senior leadership.
Why We Can’t All Get Away with Wearing Designer ClothesIn certain professions, luxury goods can send the wrong signal.
Why You Should Skip the Easy Wins and Tackle the Hard Task FirstNew research shows that you and your organization lose out when you procrastinate on the difficult stuff.
How Are Black–White Biracial People Perceived in Terms of Race?Understanding the answer—and why black and white Americans may percieve biracial people differently—is increasingly important in a multiracial society.
Which Form of Government Is Best?Democracies may not outlast dictatorships, but they adapt better.
When Do Open Borders Make Economic Sense?A new study provides a window into the logic behind various immigration policies.
Why Do Some People Succeed after Failing, While Others Continue to Flounder?A new study dispels some of the mystery behind success after failure.
How Has Marketing Changed over the Past Half-Century?Phil Kotler’s groundbreaking textbook came out 55 years ago. Sixteen editions later, he and coauthor Alexander Chernev discuss how big data, social media, and purpose-driven branding are moving the field forward.
How Old Are Successful Tech Entrepreneurs?A definitive new study dispels the myth of the Silicon Valley wunderkind.
How Offering a Product for Free Can BackfireIt seems counterintuitive, but there are times customers would rather pay a small amount than get something for free.
Immigrants to the U.S. Create More Jobs than They TakeA new study finds that immigrants are far more likely to found companies—both large and small—than native-born Americans.
College Campuses Are Becoming More Diverse. But How Much Do Students from Different Backgrounds Actually Interact?Increasing diversity has been a key goal, “but far less attention is paid to what happens after we get people in the door.”
How Peer Pressure Can Lead Teens to Underachieve—Even in Schools Where It’s “Cool to Be Smart”New research offers lessons for administrators hoping to improve student performance.