On a scale of one to ten, how surprised would you be to learn that a common professional evaluation tool is biased against women? Given the many inequalities in the modern workplace—from pay disparity to freezing-cold conference rooms—some people might rate their disbelief at yet another inequity at a zero.
The culprit this time? The ten-point rating scale itself.
Numeric performance ratings of employees are “staples of the modern workplace,” says Kellogg’s Lauren Rivera. In some industries, such as consulting, it’s common for employees to be rated after every project they complete. And the stakes are high. How a worker measures up can influence short-term compensation decisions and long-term career trajectories.
Previous studies show that, in general, men have a leg up in performance evaluations and are consistently deemed more able, likeable, and worthy than women, even when their work is identical. But no one had examined whether numeric rating systems themselves—what Rivera calls “the architecture of evaluation”—might be contributing to the problem. So Rivera and coauthor András Tilcsik of the University of Toronto decided to explore the question.
They found that the rating system can be biased against women. The ten-point scale places women, especially those in male-dominated fields, at a significant disadvantage. But, crucially, that disadvantage vanishes when men and women are evaluated on a six-point scale.
Although we think such systems are objective, Rivera says, “they’re not neutral instruments at all.” For any employer that uses numeric evaluations, “the rating system you choose matters, so choose it wisely.”
Understanding When Gender Bias Creeps into Performance Evaluations
Rivera and Tilcsik had an unexpected bit of luck when they began their research: they learned that a professional school at a North American university was planning to switch its instructor evaluation system from a ten-point scale to a six-point scale.
The school’s decision had nothing to do with gender. Administrators suspected students were mentally converting the ten-point scale into percentages and letter-grade scores, making them hesitant to give their instructors ratings below a seven. The school’s leaders theorized that a different scale might yield more varied and accurate results.
The switch created an ideal natural experiment. Rivera and Tilcsik could compare how the same instructors teaching the same courses fared under different numerical rating systems. They weren’t sure what they would find.
Because of stereotypes that associate men, but not women, with brilliance and excellence, it’s more difficult for women to get the top rating on any evaluation. For that reason, it seemed possible that scales with fewer points might ultimately disadvantage women, since a five on a six-point scale is a lower assessment than a nine on a ten-point scale. Yet they also wondered if the ubiquity of the ten-point scale, and the strong cultural association of the number ten with excellence, might make the ten-point scale an especially unbalanced instrument.
“Due to gender stereotypes of competence, we just don’t think women are perfect. We are more likely to scrutinize women and their performance.”
The researchers collected course evaluations for 29 academic terms: 20 before the switch from a ten- to a six-point scale, and nine after that change. The sample included 105,304 ratings of 369 instructors. The researchers also identified four areas of study that were particularly male dominated, in which women made up less than 15 percent of instructors.
They were heartened to discover that women teaching in non-male-dominated fields were evaluated about the same as men under both rating systems. The average rating and distribution of ratings for men and women were nearly identical under the ten-point system, they found. And switching to a six-point scale did not affect the frequency or distribution of ratings in these fields, nor did it affect the likelihood of women receiving a perfect rating.
But it was an entirely different story for women in male-dominated fields.
In these areas, 31.4 percent of male instructors’ ratings were perfect tens, but only 19.5 percent of female instructors’ ratings were tens. In fact, for men in male-dominated fields, a ten was the most common rating. Women’s most common rating was an eight. Male instructors had an average rating of 8.2; for women, the average rating was half a point lower, at 7.7.
Yet these differences vanished with the six-point rating scale. Male and female instructors received a perfect score of six at almost the same frequency: 41.2 percent for men and 41.7 for women. The gap in men’s and women’s average ratings narrowed too: 4.91 for men and 5.01 for women.
This came as a surprise to Rivera and Tilcsik. “I was expecting the gap between men and women to narrow, but it was pretty striking that it eliminated the gender gap in ratings,” Rivera says.
They were especially struck by how much of a difference it made for women being evaluated in male-dominated fields. “What that suggested to us is that potentially the real opportunity for intervention is in those stereotypically male-dominated arenas, as opposed to ones that might be more gender mixed,” Rivera says.
Why “John” Is Brilliant, but “Julie” Is Just Smart
Although Rivera and Tilcsik were intrigued by the result of the natural experiment, they wanted to make sure it wasn’t limited to the one school they studied. So they conducted a complementary online survey.
They recruited 400 professional-school students from across the United States, who all read the same lecture about the social and economic implications of technological change. Some participants were told that the lecture was delivered by Professor John Anderson, while the rest were told it was from Professor Julie Anderson. Then, participants were asked to rate the instructor on either a ten- or six-point scale, and to list the words that first came to mind when they thought of the instructor’s performance.
The results echoed what the researchers saw in the field experiment. Under the ten-point scale, Professor John received an average rating of 7.8, while Professor Julie’s average was 7.1. Once again, the gap in average ratings shrank when the instructors were ranked on a six-point scale—4.9 for John and 4.8 for Julie.
And, just as the researchers saw in the field experiment, participants were more willing to give female instructors the top rating on the six-point scale than on the ten-point scale. Julie received ten out of ten in only 13 percent of cases, while John got a perfect ten 22 percent of the time. But they got perfect sixes at nearly equal rates—25 for John and 24 percent for Julie.
“We’re finding a way to turn down the volume on gender stereotypes.”
Still, there were noticeable differences in how participants described the professors. Superlatives like “brilliant,” “genius,” and “amazing” were applied much more frequently to John than to Julie, whose teaching was more often characterized by participants as simply good.
It was clear to Rivera and Tilcsik that participants seemed to have different expectations for a “perfect ten” and “perfect six”: among participants who gave a ten out of ten rating, 54.2 percent used superlative language to describe the professor. Among participants who gave a perfect six, only 28.6 percent used such language.
The Not-So-Perfect Ten
So, why does the ten-point scale disadvantage women? Rivera thinks the loaded cultural language of the “perfect ten” may be partly to blame.
“The number ten carries this cultural connotation of perfection,” she says. “Research shows that, due to gender stereotypes of competence, we just don’t think women are perfect. We are more likely to scrutinize women and their performance.”
This difference helps explain why the effect of the ten-point scale was so pronounced in male-dominated fields, where stereotypes of male brilliance are especially strong. When people imagine the standouts in a stereotypically male field—the perfect tens—the figures that come to mind most readily are men. It’s much easier, then, for raters to associate a man than a woman with this preconceived idea of excellence.
Of course, there’s nothing magical about six-point scales in and of themselves. But Rivera believes the change from ten to six could be useful in any field that uses numeric evaluations. It can act as a “bias interrupter”: by removing the familiar and culturally fraught concept of the “perfect ten,” and creating a more neutral mindset for raters, “we’re finding a way to turn down the volume on gender stereotypes,” she says.
Yet moving away from ten-point scales in performance evaluations isn’t a panacea, Rivera points out.
“We should work on changing the images that people get from a very young age, how we structure work, how we recognize others, the messages we see in the media,” she says. But while we undertake the long, slow work of cultural change, it’s important to put small, meaningful changes into place. “I think interventions such as this have a lot of power to reduce the effect of biased evaluations on people’s career opportunities.”
About the Writer
Susie Allen is a freelance writer in Chicago.
About the Research
Rivera, Lauren A., and András Tilcsik. 2019. “Scaling Down Inequality: Rating Scales, Gender Bias, and the Architecture of Evaluation.” American Sociological Review. 84(2): 248–274. https://journals.sagepub.com/doi/10.1177/0003122419833601