A magazine of research and ideas from the
faculty of the Kellogg School of Management

Marketing Aug 1, 2025

When AI Thinks Too Much Like a Human

Generative AI models are susceptible to the same errors that humans make when interpreting statistical results.

Based on the research of

Blakeley B. McShane

David Gal

Adam Duhachek

illustration of a person pouring colorful pellets into a funnel, with the pellets coming out of two downspouts as black and white.

Jesús Escudero

Based on the research of

Blakeley B. McShane

David Gal

Adam Duhachek

Summary As AI models learn to reason in a way that resembles human reasoning, they also fall victim to the same errors that people make. In a series of experiments, Kellogg’s Blake McShane and colleagues discovered that popular generative AI models were susceptible to the same bias as humans when interpreting statistical results. The AI models’ systematic inability to properly interpret even basic results raises doubts about their capacity to perform more-ambitious tasks.

Earlier this year, AI developer Anthropic released a new model that can spend more time “thinking” through a problem, similarly to the way a person might. Stanford and IBM developed AI “twins” of more than 1,000 people that supposedly reason and make decisions just like their real-life counterparts. The hope, for many companies in this space, is to build AI models that reason in a manner that is nearly indistinguishable from (or even better than) the way the humans who use them might reason.

“AI that better mimics humans generally seems like a good thing,” says Blake McShane, a professor of marketing at Kellogg. “But when AI mimics human errors, that’s obviously a bad thing when accuracy is the goal.”

Humans tend to view the world as dichotomous, instead of continuous. This black-and-white way of thinking also holds in science, for example, when researchers apply arbitrary thresholds to their results—an approach that can lead to errors in interpretation.

In a new study, McShane and two colleagues, David Gal and Adam Duhachek, from the University of Illinois Chicago found that AI models fall victim to these errors just like human researchers.

“Given that AI models ‘learn’ from human text and that humans make these mistakes all the time, we hazarded that the AI models would do the same,” McShane says.

“Statistical significance” in scientific practice

Researchers have long relied on statistical tests to interpret the results of a study. One of the most popular tests, the null hypothesis significance test, provides a measure known as a P-value that falls between zero and one. Conventionally, researchers consider their results “statistically significant” when the P-value is below 0.05 and “statistically nonsignificant” when it is above it.

A cognitive error very often comes along with this dichotomization: researchers wrongly interpret “statistical significance” as demonstrating the effect they’re studying and “statistical nonsignificance” as demonstrating that there is no effect.

Compounding matters, the 0.05 threshold has become a kind of gatekeeper for publishing research. Studies that report “statistically significant” results are much more likely to get published than those that don’t, even if their P-values are almost the same. This results in a biased literature. It also encourages harmful research practices that push the P-value to the desired side of the threshold.

Because the P-value is a continuous measure of evidence, McShane says, a P-value just above the 0.05 threshold is essentially identical to one just below it. But it is even trickier than that, he says. In addition to being continuous, P-values naturally vary a great deal from study to study. Therefore, an initial study with a P-value of 0.005 and a replication study with 0.19 are entirely compatible with one another—despite the first P-value being far below the 0.05 threshold and the second one far above it.

Yet his previous work with Gal found that most researchers adhere blindly to the arbitrary 0.05 “statistical significance” threshold, treating the results as black and white rather than continuous.

Like human, like AI

McShane and his colleagues investigated whether AI models ChatGPT, Gemini, and Claude like humans also rigidly adhere to the 0.05 “statistical significance” threshold when interpreting statistical results by asking these AI models to interpret the outcomes of three different hypothetical experiments.

“As with people, this ‘dichotomania’ seems deeply embedded in the way the AI models respond.”
—
Blake McShane

The first was about survival rates among terminal-cancer patients. The patients in this experiment were assigned to one of two groups: Group A, where they wrote daily about positive things they were blessed with, or Group B, where they wrote daily about the misfortunes of others. The results of this experiment were that, on average, patients in Group A lived for 8.2 months after their initial diagnosis, compared with 7.5 months for patients in Group B.

After presenting this information to the AI models, the researchers asked them which of the following four options provided the most-accurate summary of the results: on average, patients in Group A lived longer post-diagnosis than those in Group B; on average, patients in Group B lived longer post-diagnosis than those Group A; the number of months lived post-diagnosis was no different between the two groups; or it cannot be determined which group lived longer. The researchers asked each AI model to answer this question but varied the P-value comparing the two groups from a “statistically significant” 0.049 to a trivially different but “statistically nonsignificant” 0.051.

There was a clear division in how the AI models responded depending on the P-value: they nearly always responded that Group A lived longer when it was 0.049 (“statistically significant”) but did so much less often when it was 0.051 (“statistically nonsignificant”).

“The responses differed when the 0.05 threshold was crossed,” McShane says. “A tiny change in the input resulted in a big change in the output.”

The researchers encountered the same outcome for the two other hypothetical experiments. For example, in one about drug efficacy—where the results for Drug A were more promising than those for Drug B—they asked the AI models whether a patient would be more likely to recover if given Drug A or Drug B. The AI models almost always answered Drug A when the P-value was 0.049 but very seldom when it was 0.051.

In all of these experiments, the outcomes closely mirrored what happened when academic researchers answered the same questions in prior studies. Where the P-value stood relative to the 0.05 “statistical significance” threshold consistently played a key role in shaping how both humans and the AI models interpreted the results.

The AI models even invoked “statistical significance” in the absence of a P-value. “We conducted some trials where we didn’t give a P-value at all, and the responses would nonetheless still emphasize ‘statistical significance,’” McShane says. “As with people, this ‘dichotomania’ seems deeply embedded in the way the AI models respond.”

A word of warning

The researchers expanded the study by feeding the AI prompts with explicit instruction from the American Statistical Association warning against relying on P-value thresholds when interpreting quantitative results. Despite this guidance, the AI models still responded dichotomously, answering one way when the P-value was 0.049 and another way when it was 0.051.

Even the more-powerful and more-recent AI models were susceptible to this. ChatGPT, for example, released a new version of its AI model while McShane and his colleagues were conducting this work—a model designed to break down problems into smaller components and iteratively reason through answers. This updated AI model responded even more dichotomously than older models.

“I can’t conclusively say why that is, but if I were to speculate, perhaps it is because these newer and bigger models more effectively mimic human responses,” McShane says. “If that’s the case, then the closer that these AI models get to generating text that looks like human-generated text, the more their responses should fall into traps that humans fall into, whether around ‘statistical significance’ as in our research or presumably more broadly as well.”

For McShane, these results raise red flags as people in academia and other industries integrate AI with greater autonomy into more dimensions of their work. He noted that researchers are already using AI to summarize papers, conduct literature reviews, perform statistical analyses, and even pursue novel scientific discoveries. And yet every model that he and his coauthors tested demonstrated a systematic inability to properly interpret basic statistical results—a seemingly necessary precondition, McShane says, to all this other work.

“People are asking AI models to do things that are much more complicated than the basic little multiple-choice questions that we asked,” he says, “but if they perform so erratically on our questions, it raises doubt about its capability for these much more-ambitious tasks.”

Featured Faculty

Blakeley B. McShane

Professor of Marketing; Mondelez Chair in Marketing; Marketing Department Chair

David Gal

Member of the Marketing Department faculty until 2014

About the Writer

Dylan Walsh is a freelance writer based in Chicago.

About the Research

McShane, Blakeley B., David Gal, and Adam Duhachek. 2025. “Artificial Intelligence and Dichotomania.” Judgment and Decision Making.

Read the original

Insight in your inbox

Receive our newsletters to keep up with the latest research and ideas from faculty at the Kellogg School of Management.

This website uses cookies and similar technologies to analyze and optimize site usage. By continuing to use our websites, you consent to this. For more information, please read our Privacy Statement.