Blinded by Statistical Significance
Skip to content
Data AnalyticsStrategy Marketing Dec 7, 2015

Blind­ed by Sta­tis­ti­cal Significance

Putting too much stock in an arbi­trary thresh­old may lead to bad decisions.

Yevgenia Nayberg

Based on the research of

Blakeley B. McShane

David Gal

Today more than ever, deci­sions are made using sta­tis­ti­cal tests. Researchers use them to decide whether a new drug is more effec­tive than a place­bo; com­pa­nies use them to inter­pret those increas­ing­ly ubiq­ui­tous A/B tests. After all, sta­tis­ti­cal tests offer a clear­head­ed, ratio­nal, and unbi­ased way to make sense of data.

Add Insight
to your inbox.

We’ll send you one email a week with content you actually want to read, curated by the Insight team.

Or do they?

Accord­ing to new research by Blake­ley McShane, an asso­ciate pro­fes­sor of mar­ket­ing at the Kel­logg School, one of the most pop­u­lar sta­tis­ti­cal tests, the null hypoth­e­sis sig­nif­i­cance test, is often inter­pret­ed incor­rect­ly. Researchers are biased to pay atten­tion only to whether a test is sta­tis­ti­cal­ly sig­nif­i­cant,” which by con­ven­tion occurs when a mea­sure known as the test’s p-val­ue is below 0.05. Should the p-val­ue fall below the 0.05 thresh­old, a new drug or cor­po­rate strat­e­gy is deemed a suc­cess; should it fall just above, it is treat­ed as worthless.

But this all-or-noth­ing view of hypoth­e­sis test­ing is wrong, says McShane. The p-val­ue is actu­al­ly a con­tin­u­ous mea­sure of evi­dence”— mean­ing that a p-val­ue just above the 0.05 thresh­old is essen­tial­ly iden­ti­cal to one just below it, and thus offers sim­i­lar sup­port for a con­clu­sion. Nonethe­less, in prac­tice peo­ple treat p-val­ues dichotomously.”

McShane and his coau­thor, David Gal of the Uni­ver­si­ty of Illi­nois at Chica­go, find that, iron­i­cal­ly, experts are more like­ly than novices to be blind­ed by the glare of sta­tis­ti­cal sig­nif­i­cance, dis­miss­ing a study whose p-val­ue is just a wee bit above the mag­i­cal threshold.

These aren’t fresh­men enrolled in Stat 101,” says McShane. These are the edi­to­r­i­al board of Psy­cho­log­i­cal Sci­ence and authors from the New Eng­land Jour­nal of Med­i­cine and the Amer­i­can Eco­nom­ic Review—top jour­nals in their respec­tive fields.” And yet they may be ignor­ing promis­ing ideas and treatments.

The Gold Standard

For obvi­ous rea­sons, researchers can­not test whether a drug works on all poten­tial patients. It is sim­i­lar­ly imprac­ti­cal for a fast-food chain to test an ad’s poten­cy on all con­ceiv­able cus­tomers. So instead, researchers try out new drugs and ads on rep­re­sen­ta­tive sam­ples of patients and cus­tomers. Then, they use hypoth­e­sis tests to assess whether any effects observed in that sam­ple are like­ly to be real” or whether they are con­sis­tent with being a chance occurrence.

When a hypoth­e­sis test yields a p-val­ue below 0.05, researchers declare their result sta­tis­ti­cal­ly sig­nif­i­cant and believe they have found a real effect. This 0.05 val­ue is tak­en very seri­ous­ly. Why? Well, when you get a p-val­ue below 0.05, you’ve struck gold and can pub­lish your results!” says McShane. But if it’s above 0.05, you’re hosed and it’s back to the draw­ing board.”

Every­body knows the 0.05 stan­dard is a fair­ly arbi­trary con­ven­tion. Nonethe­less, you get to anoint your results sta­tis­ti­cal­ly sig­nif­i­cant and pub­lish away.”

How­ev­er, there is noth­ing mag­i­cal about the num­ber 0.05 — except per­haps how dom­i­nant it has become in prac­tice. Every­body knows the 0.05 stan­dard is a fair­ly arbi­trary con­ven­tion,” says McShane. Nonethe­less, you get to anoint your results sta­tis­ti­cal­ly sig­nif­i­cant and pub­lish away.”

Expert­ly Confused

McShane and Gal won­dered whether the 0.05 thresh­old has become so engrained in the minds of researchers that they would sim­ply dis­count any results that were over that threshold.

So in their first set of exper­i­ments, they pro­vid­ed a sum­ma­ry of a hypo­thet­i­cal study to hun­dreds of aca­d­e­m­ic and indus­try researchers, includ­ing the edi­to­r­i­al boards of pres­ti­gious jour­nals as well as authors who pub­lish in them. The study com­pared two dif­fer­ent treat­ments for ter­mi­nal patients and tracked how long they lived after diag­no­sis. Crit­i­cal­ly, some researchers saw a sum­ma­ry that includ­ed a p-val­ue below the 0.05 thresh­old, while oth­ers saw an iden­ti­cal sum­ma­ry, except the p-val­ue was above the threshold.

Next, par­tic­i­pants were asked a sim­ple descrip­tive ques­tion about the study — one that did not require any sta­tis­ti­cal train­ing to answer and for which the p-val­ue was entire­ly irrel­e­vant. We just asked which group of sub­jects end­ed up liv­ing longer on aver­age, those who were assigned to take treat­ment A or those who were assigned to take treat­ment B,” says McShane.

McShane and Gal found that the p-val­ue had a huge impact on the researchers’ respons­es — even though it was irrel­e­vant to the ques­tion. About 90% cor­rect­ly described the data when we gave them a p-val­ue that was below 0.05,” McShane says, while only about 15% did when we gave them one that was above 0.05.”

Put anoth­er way, the par­tic­i­pants cor­rect­ly answered that those who took treat­ment A lived longer on aver­age than those who took treat­ment B when the p-val­ue indi­cat­ed sta­tis­ti­cal sig­nif­i­cance; but, they failed to iden­ti­fy the dif­fer­ence in aver­age life­times when the p-val­ue missed sta­tis­ti­cal significance.

It was their exper­tise that doomed them. Almost 75% of naïve par­tic­i­pants — under­grad­u­ates who had nev­er tak­en a sta­tis­tics class — answered the same ques­tion cor­rect­ly regard­less of whether the p-val­ue was sta­tis­ti­cal­ly sig­nif­i­cant or not.

They didn’t know what the p-val­ue meant, so they ignored it,” says McShane. Since it didn’t mat­ter in this case, that was the right thing to do, so it real­ly behooved them.”

A Stub­born Tendency

In anoth­er set of exper­i­ments, McShane and Gal found that the notion of sta­tis­ti­cal sig­nif­i­cance cloud­ed researchers’ judg­ments when they were asked which of two drugs was more like­ly to be more effec­tive for a hypo­thet­i­cal new patient. When the p-val­ue was sta­tis­ti­cal­ly sig­nif­i­cant, about 80% of par­tic­i­pants judged that a hypo­thet­i­cal new patient would be bet­ter off with the drug that per­formed bet­ter in the study (Drug A); when the p-val­ue was not sta­tis­ti­cal­ly sig­nif­i­cant, that per­cent­age dropped to about 20%.

Per­haps even more alarm­ing­ly, efforts to make the sce­nario hit clos­er to home — by ask­ing researchers to pre­tend that they them­selves were the patient and to choose Drug A, Drug B, or express indif­fer­ence — still revealed a steep drop based on the p-val­ue: about 90% chose Drug A when the p-val­ue was sta­tis­ti­cal­ly sig­nif­i­cant ver­sus about 50% when it was not. With their own lives at stake, par­tic­i­pants were some­what more will­ing to go with the more effec­tive treat­ment when the study was not sta­tis­ti­cal­ly sig­nif­i­cant. But we still get a big drop, reflect­ing a focus on sta­tis­ti­cal sig­nif­i­cance — even when they are mak­ing a per­son­al­ly con­se­quen­tial choice,” says McShane.

More­over, par­tic­i­pants respond­ed to the ques­tions the same way when pre­sent­ed with a p-val­ue that just bare­ly missed the 0.05 thresh­old as they did when pre­sent­ed with one that missed it by leaps and bounds. As we fur­ther and fur­ther increased the p-val­ue, our par­tic­i­pants’ respons­es to either ques­tion did not change,” says McShane, even though, as the p-val­ue goes up and up, the evi­dence in favor of one drug over the oth­er is weak­er and weaker.”

Even manip­u­lat­ing the dif­fer­ence in effec­tive­ness between the two drugs in the study — Drug A cur­ing 20 per­cent­age points more patients than Drug B ver­sus just 8 per­cent­age points more, for instance — did not sway researchers from focus­ing sole­ly on sta­tis­ti­cal significance.

When we var­ied the mag­ni­tude of the treat­ment dif­fer­ence — some­thing you should real­ly care about if you are a patient — there was essen­tial­ly no impact on our results,” says McShane. Our par­tic­i­pants seemed to focus exclu­sive­ly on the p-val­ue — and not just on the p-val­ue itself, but on whether or not it was below 0.05.”

Beyond Mag­i­cal Thinking

This find­ing has obvi­ous rel­e­vance to researchers, who may be over­look­ing pow­er­ful treat­ments, solu­tions, and ideas because their p-val­ue fails to pass an arbi­trary thresh­old. But steer­ing clear of mag­i­cal think­ing about 0.05 may be eas­i­er said than done: mis­un­der­stand­ings have long plagued the sci­ence com­mu­ni­ty. In the arti­cle, McShane says, we cite things from the 50s, 60s, 70s, 80s, 90s, and 2000s that all make more or less the same point” about the arbi­trari­ness of the 0.05 threshold.

McShane and Gal’s work also has increas­ing rel­e­vance for busi­ness prac­ti­tion­ers who con­duct tests to bet­ter under­stand customers.

Of course, A/B test­ing is noth­ing new, McShane says. Retail-cat­a­logue com­pa­nies like Eddie Bauer and Lands’ End have been run­ning tests for decades, for instance, send­ing out dif­fer­ent ver­sions of the cat­a­log — per­haps dif­fer­ent prices, per­haps dif­fer­ent pho­tographs, etc. — to see which worked best. And Cray­ola was test­ing email cam­paigns to dri­ve traf­fic to their web­site almost twen­ty years ago.”

As oper­a­tions have increas­ing­ly shift­ed online, cor­po­rate exper­i­men­ta­tion has cor­re­spond­ing­ly explod­ed. Today, almost every­thing that can be quan­ti­fied is also being tested.

So what can busi­ness­es do to ensure that they are not blind­ed by sta­tis­ti­cal sig­nif­i­cance? Take a more holis­tic view,” says McShane. Rather than focus­ing sin­gle-mind­ed­ly on p-val­ues or any oth­er sta­tis­ti­cal results, man­agers should also take into account the con­text in which the results were derived. How well was the test designed? How rep­re­sen­ta­tive was the sam­ple that was test­ed? Is there relat­ed evi­dence from sim­i­lar stud­ies or his­tor­i­cal data? And what are the real-world costs and ben­e­fits of imple­ment­ing a new strategy?

You might have a sta­tis­ti­cal­ly sig­nif­i­cant p-val­ue, but how much do you think rev­enue will increase if you imple­ment the new strat­e­gy, how much will it cost, and will switch­ing gears frus­trate your cus­tomers?” says McShane. Although they can be dif­fi­cult to quan­ti­fy, the real-world costs and ben­e­fits real­ly mat­ter — as much as and prob­a­bly more than the result of the hypoth­e­sis test.” And these costs and ben­e­fits can vary wild­ly by industry.

Final­ly, while it may be tempt­ing to throw a naïve under­grad­u­ate onto your data-sci­ence team to thwart p-val­ue fever, McShane does not rec­om­mend this. Eval­u­at­ing uncer­tain evi­dence is real­ly hard. This is the rare case where the naïve per­son per­forms bet­ter,” he says.

About the Writer

Jessica Love is editor-in-chief of Kellogg Insight.

About the Research

McShane, Blakeley B. 2015. Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence. Management Science, 62 (6), 1707-1718.

Read the original

Suggested For You

Most Popular

Organizations

How Are Black – White Bira­cial Peo­ple Per­ceived in Terms of Race?

Under­stand­ing the answer — and why black and white Amer­i­cans’ respons­es may dif­fer — is increas­ing­ly impor­tant in a mul­tira­cial society.

Careers

Pod­cast: Our Most Pop­u­lar Advice on Advanc­ing Your Career

Here’s how to con­nect with head­hunters, deliv­er with data, and ensure you don’t plateau professionally.

Most Popular Podcasts

Careers

Pod­cast: Our Most Pop­u­lar Advice on Improv­ing Rela­tion­ships with Colleagues

Cowork­ers can make us crazy. Here’s how to han­dle tough situations.

Social Impact

Pod­cast: How You and Your Com­pa­ny Can Lend Exper­tise to a Non­prof­it in Need

Plus: Four ques­tions to con­sid­er before becom­ing a social-impact entrepreneur.

Careers

Pod­cast: Attract Rock­star Employ­ees — or Devel­op Your Own

Find­ing and nur­tur­ing high per­form­ers isn’t easy, but it pays off.

Marketing

Pod­cast: How Music Can Change Our Mood

A Broad­way song­writer and a mar­ket­ing pro­fes­sor dis­cuss the con­nec­tion between our favorite tunes and how they make us feel.