Say you want to run an A/B test to compare the effectiveness of two drugs or two advertising campaigns. One of the first things you will have to determine is how big your study needs to be. What is the right sample size?
“The basic idea is if you have too few subjects—if your sample size is too small—statistically you won’t be able to detect anything. You’ve wasted resources by running the study in the first place because it never had any chance to find anything,” says Blakeley McShane, an associate professor of marketing at the Kellogg School. “On the other hand, let’s say you’ve got millions of subjects. Well, then you’ve probably wasted resources because you could have learned the answer to your question with many fewer subjects.”
So how can you determine an appropriate sample size? This depends on how big of an effect you are expecting to see. If one drug or campaign is vastly more effective than another, you may only need about twenty people to determine that. But, if the differences you are testing are small—like two ads with only minor differences in font size or spacing—then it may take hundreds of participants for a convincing effect to emerge.
Standard statistical techniques for determining sample size require researchers to know or assume a value for the effect size before running the study. But this, says McShane, can be problematic. “If I knew that this ad were more effective than that ad—let alone exactly how much more effective—then I wouldn’t need to do the study in the first place,” he says.
In practice, people estimate the effect size by looking at the results from any previous, related studies that happen to be available. However, these studies can yield only an approximation of the effect size (because they too had finite sample sizes).
Thus, in recent research with Ulf Böckenholt, a Kellogg School marketing professor, McShane developed a way to calculate a sample size that also takes into account the uncertainty involved in this approximation. “The kind of information we use to quantify that uncertainty is the same idea as a margin of error in a poll,” he explains. The new technique represents a particular improvement when effect sizes are small—as is often the case for many online A/B tests, such as tweaks to a search algorithm.
Researchers are welcome to determine their own sample size using this calculator.
McShane and Böckenholt have also built a second calculator. This one helps researchers deal with another troublesome assumption—that if you run an experiment multiple times, every one of those experiments studies the same underlying effect size.
“The problem with this is that no two studies in behavioral research are ever exactly the same,” says McShane. “Maybe you designed the study somewhat differently; maybe you’re testing it on a different population of subjects. These can all lead to differences in the effect size under study and thus the sample size required.” Interested parties can find this other calculator here.
The researchers created these tools with their fellow academics in mind. But McShane says that anyone who runs studies that use human participants might find them useful.
“There’s variability out there in the world,” he says. “To get reliable results, we want to account for as many of those sources of variability as we possibly can. This includes ones we understand and can measure and control for. However, it also includes those we don’t understand but can still quantify.”