Blinded by Statistical Significance
Skip to content
Data AnalyticsStrategy Marketing Dec 7, 2015

Blinded by Statistical Significance

Putting too much stock in an arbitrary threshold may lead to bad decisions.

Yevgenia Nayberg

Based on the research of

Blakeley B. McShane

David Gal

Today more than ever, decisions are made using statistical tests. Researchers use them to decide whether a new drug is more effective than a placebo; companies use them to interpret those increasingly ubiquitous A/B tests. After all, statistical tests offer a clearheaded, rational, and unbiased way to make sense of data.

Or do they?

According to new research by Blakeley McShane, an associate professor of marketing at the Kellogg School, one of the most popular statistical tests, the null hypothesis significance test, is often interpreted incorrectly. Researchers are biased to pay attention only to whether a test is “statistically significant,” which by convention occurs when a measure known as the test’s p-value is below 0.05. Should the p-value fall below the 0.05 threshold, a new drug or corporate strategy is deemed a success; should it fall just above, it is treated as worthless.

But this all-or-nothing view of hypothesis testing is wrong, says McShane. “The p-value is actually a continuous measure of evidence”— meaning that a p-value just above the 0.05 threshold is essentially identical to one just below it, and thus offers similar support for a conclusion. “Nonetheless, in practice people treat p-values dichotomously.”

McShane and his coauthor, David Gal of the University of Illinois at Chicago, find that, ironically, experts are more likely than novices to be blinded by the glare of statistical significance, dismissing a study whose p-value is just a wee bit above the magical threshold.

“These aren’t freshmen enrolled in Stat 101,” says McShane. “These are the editorial board of Psychological Science and authors from the New England Journal of Medicine and the American Economic Review—top journals in their respective fields.” And yet they may be ignoring promising ideas and treatments.

The Gold Standard

For obvious reasons, researchers cannot test whether a drug works on all potential patients. It is similarly impractical for a fast-food chain to test an ad’s potency on all conceivable customers. So instead, researchers try out new drugs and ads on representative samples of patients and customers. Then, they use hypothesis tests to assess whether any effects observed in that sample are likely to be “real” or whether they are consistent with being a chance occurrence.

When a hypothesis test yields a p-value below 0.05, researchers declare their result statistically significant and believe they have found a real effect. This 0.05 value is taken very seriously. “Why? Well, when you get a p-value below 0.05, you’ve struck gold and can publish your results!” says McShane. “But if it’s above 0.05, you’re hosed and it’s back to the drawing board.”

“Everybody knows the 0.05 standard is a fairly arbitrary convention. Nonetheless, you get to anoint your results statistically significant and publish away.”

However, there is nothing magical about the number 0.05—except perhaps how dominant it has become in practice. “Everybody knows the 0.05 standard is a fairly arbitrary convention,” says McShane. “Nonetheless, you get to anoint your results statistically significant and publish away.”

Expertly Confused

McShane and Gal wondered whether the 0.05 threshold has become so engrained in the minds of researchers that they would simply discount any results that were over that threshold.

So in their first set of experiments, they provided a summary of a hypothetical study to hundreds of academic and industry researchers, including the editorial boards of prestigious journals as well as authors who publish in them. The study compared two different treatments for terminal patients and tracked how long they lived after diagnosis. Critically, some researchers saw a summary that included a p-value below the 0.05 threshold, while others saw an identical summary, except the p-value was above the threshold.

Next, participants were asked a simple descriptive question about the study—one that did not require any statistical training to answer and for which the p-value was entirely irrelevant. “We just asked which group of subjects ended up living longer on average, those who were assigned to take treatment A or those who were assigned to take treatment B,” says McShane.

McShane and Gal found that the p-value had a huge impact on the researchers’ responses—even though it was irrelevant to the question. “About 90% correctly described the data when we gave them a p-value that was below 0.05,” McShane says, “while only about 15% did when we gave them one that was above 0.05.”

Put another way, the participants correctly answered that those who took treatment A lived longer on average than those who took treatment B when the p-value indicated statistical significance; but, they failed to identify the difference in average lifetimes when the p-value missed statistical significance.

It was their expertise that doomed them. Almost 75% of naïve participants—undergraduates who had never taken a statistics class—answered the same question correctly regardless of whether the p-value was statistically significant or not.

“They didn’t know what the p-value meant, so they ignored it,” says McShane. “Since it didn’t matter in this case, that was the right thing to do, so it really behooved them.”

A Stubborn Tendency

In another set of experiments, McShane and Gal found that the notion of statistical significance clouded researchers’ judgments when they were asked which of two drugs was more likely to be more effective for a hypothetical new patient. When the p-value was statistically significant, about 80% of participants judged that a hypothetical new patient would be better off with the drug that performed better in the study (Drug A); when the p-value was not statistically significant, that percentage dropped to about 20%.

Perhaps even more alarmingly, efforts to make the scenario hit closer to home—by asking researchers to pretend that they themselves were the patient and to choose Drug A, Drug B, or express indifference—still revealed a steep drop based on the p-value: about 90% chose Drug A when the p-value was statistically significant versus about 50% when it was not. With their own lives at stake, participants were somewhat more willing to go with the more effective treatment when the study was not statistically significant. But “we still get a big drop, reflecting a focus on statistical significance—even when they are making a personally consequential choice,” says McShane.

Moreover, participants responded to the questions the same way when presented with a p-value that just barely missed the 0.05 threshold as they did when presented with one that missed it by leaps and bounds. “As we further and further increased the p-value, our participants’ responses to either question did not change,” says McShane, “even though, as the p-value goes up and up, the evidence in favor of one drug over the other is weaker and weaker.”

Even manipulating the difference in effectiveness between the two drugs in the study—Drug A curing 20 percentage points more patients than Drug B versus just 8 percentage points more, for instance—did not sway researchers from focusing solely on statistical significance.

“When we varied the magnitude of the treatment difference—something you should really care about if you are a patient—there was essentially no impact on our results,” says McShane. “Our participants seemed to focus exclusively on the p-value—and not just on the p-value itself, but on whether or not it was below 0.05.”

Beyond Magical Thinking

This finding has obvious relevance to researchers, who may be overlooking powerful treatments, solutions, and ideas because their p-value fails to pass an arbitrary threshold. But steering clear of magical thinking about 0.05 may be easier said than done: misunderstandings have long plagued the science community. In the article, McShane says, “we cite things from the 50s, 60s, 70s, 80s, 90s, and 2000s that all make more or less the same point” about the arbitrariness of the 0.05 threshold.

McShane and Gal’s work also has increasing relevance for business practitioners who conduct tests to better understand customers.

Of course, A/B testing is nothing new, McShane says. “Retail-catalogue companies like Eddie Bauer and Lands’ End have been running tests for decades, for instance, sending out different versions of the catalog—perhaps different prices, perhaps different photographs, etc.—to see which worked best. And Crayola was testing email campaigns to drive traffic to their website almost twenty years ago.”

As operations have increasingly shifted online, corporate experimentation has correspondingly exploded. Today, almost everything that can be quantified is also being tested.

So what can businesses do to ensure that they are not blinded by statistical significance? “Take a more holistic view,” says McShane. Rather than focusing single-mindedly on p-values or any other statistical results, managers should also take into account the context in which the results were derived. How well was the test designed? How representative was the sample that was tested? Is there related evidence from similar studies or historical data? And what are the real-world costs and benefits of implementing a new strategy?

“You might have a statistically significant p-value, but how much do you think revenue will increase if you implement the new strategy, how much will it cost, and will switching gears frustrate your customers?” says McShane. “Although they can be difficult to quantify, the real-world costs and benefits really matter—as much as and probably more than the result of the hypothesis test.” And these costs and benefits can vary wildly by industry.

Finally, while it may be tempting to throw a naïve undergraduate onto your data-science team to thwart p-value fever, McShane does not recommend this. “Evaluating uncertain evidence is really hard. This is the rare case where the naïve person performs better,” he says.

About the Writer
Jessica Love is editor-in-chief of Kellogg Insight.
About the Research

McShane, Blakeley B. 2015. Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence. Management Science, 62 (6), 1707-1718.

Read the original

Most Popular This Week
  1. Will AI Kill Human Creativity?
    What Fake Drake tells us about what’s ahead.
    Rockstars await a job interview.
  2. Sitting Near a High-Performer Can Make You Better at Your Job
    “Spillover” from certain coworkers can boost our productivity—or jeopardize our employment.
    The spillover effect in offices impacts workers in close physical proximity.
  3. Podcast: How to Discuss Poor Performance with Your Employee
    Giving negative feedback is not easy, but such critiques can be meaningful for both parties if you use the right roadmap. Get advice on this episode of The Insightful Leader.
  4. 2 Factors Will Determine How Much AI Transforms Our Economy
    They’ll also dictate how workers stand to fare.
    robot waiter serves couple in restaurant
  5. How Are Black–White Biracial People Perceived in Terms of Race?
    Understanding the answer—and why black and white Americans may percieve biracial people differently—is increasingly important in a multiracial society.
    How are biracial people perceived in terms of race
  6. Will AI Eventually Replace Doctors?
    Maybe not entirely. But the doctor–patient relationship is likely to change dramatically.
    doctors offices in small nodules
  7. What’s at Stake in the Debt-Ceiling Standoff?
    Defaulting would be an unmitigated disaster, quickly felt by ordinary Americans.
    two groups of politicians negotiate while dangling upside down from the ceiling of a room
  8. The Psychological Factor That Helps Shape Our Moral Decision-Making
    We all have a preferred motivation style. When that aligns with how we’re approaching a specific goal, it can impact how ethical we are in sticky situations.
    a person puts donuts into a bag next to a sign that reads "limit one"
  9. How to Manage a Disengaged Employee—and Get Them Excited about Work Again
    Don’t give up on checked-out team members. Try these strategies instead.
    CEO cheering on team with pom-poms
  10. One Key to a Happy Marriage? A Joint Bank Account.
    Merging finances helps newlyweds align their financial goals and avoid scorekeeping.
    married couple standing at bank teller's window
  11. Which Form of Government Is Best?
    Democracies may not outlast dictatorships, but they adapt better.
    Is democracy the best form of government?
  12. What Went Wrong at AIG?
    Unpacking the insurance giant's collapse during the 2008 financial crisis.
    What went wrong during the AIG financial crisis?
  13. Take 5: Research-Backed Tips for Scheduling Your Day
    Kellogg faculty offer ideas for working smarter and not harder.
    A to-do list with easy and hard tasks
  14. Why Do Some People Succeed after Failing, While Others Continue to Flounder?
    A new study dispels some of the mystery behind success after failure.
    Scientists build a staircase from paper
  15. Daughters’ Math Scores Suffer When They Grow Up in a Family That’s Biased Towards Sons
    Parents, your children are taking their cues about gender roles from you.
    Parents' belief in traditional gender roles can affect daughters' math performance.
  16. How Has Marketing Changed over the Past Half-Century?
    Phil Kotler’s groundbreaking textbook came out 55 years ago. Sixteen editions later, he and coauthor Alexander Chernev discuss how big data, social media, and purpose-driven branding are moving the field forward.
    people in 1967 and 2022 react to advertising
  17. How the Wormhole Decade (2000–2010) Changed the World
    Five implications no one can afford to ignore.
    The rise of the internet resulted in a global culture shift that changed the world.
  18. Take 5: Yikes! When Unintended Consequences Strike
    Good intentions don’t always mean good results. Here’s why humility, and a lot of monitoring, are so important when making big changes.
    People pass an e-cigarette billboard
  19. Leave My Brand Alone
    What happens when the brands we favor come under attack?
More in Data Analytics Strategy