Today a digital advertising campaign can reach a potential customer as she skims the news on her morning train ride, takes a break from emailing at her desk, scours restaurant reviews over cocktails, and admires a friend’s vacation photos after dessert.

Subscribe

Get the latest from Kellogg Insight delivered to your inbox.

A proliferation of digital platforms and devices makes this kind of campaign possible. But it also makes measuring the campaign’s success challenging.

The gold standard for measuring an advertisement’s “lift”—that is, its direct effect on a user’s probability of converting—is a true experiment, or “randomized controlled trial” (RCT), as data scientists call it. But Florian Zettelmeyer, a professor of marketing at the Kellogg School, explains, “RCTs are hard to run. They can be expensive to run. They can take a lot of coordination to run.” So many marketers rely on a litany of alternative methods, easier to implement and often capable of drawing conclusions from data that have already been collected.

Just how well do these alternative methods work?

Zettelmeyer and Brett Gordon, an associate professor of marketing at Kellogg, recently coauthored a whitepaper with Facebook researchers Neha Bhargava and Dan Chapsky in an effort to find out. The upshot: even across a single platform, and using the exact same advertising studies, these alternative methods tend to be inaccurate—sometimes wildly so.

Benefits of a True Experiment

To accurately measure the lift of an ad placed on Facebook, Google, or another digital platform, it is not enough for marketers to calculate how likely it is that someone who sees the ad will “convert,” the industry’s lingo for an event the advertiser cares about, for example, a purchase, a registration, or page visit. They also must determine whether the conversion happened because of the ad—that is, whether the ad caused the conversion. But causality is surprisingly difficult to pin down. A perfect test would require the impossible: two parallel worlds, identical except that in the first world someone sees an ad and in the second that same person sees no ad.

"The degree of variation was stunning." - Florian Zettelmeyer

Because parallel worlds remain stubbornly unavailable to researchers, the next best thing is an RCT, where individuals are randomly divided into a treatment group, which sees the ads, and a control group, which does not. Randomization ensures that the two groups do not differ in important ways, like demographics, life style, or personality.

And yet, for a variety of reasons, many marketers do not use RCTs.

For one, until recently many platforms did not offer the capability (and some still do not). RCTs can be time-consuming to implement correctly, requiring hours of additional work from engineers and data scientists without necessarily generating any additional income. Nor is it obvious that improved accuracy would work in a given platform’s favor, as advertising is not always particularly effective. “A lot of people in the industry simply aren’t incentivized to make sure you get the right estimate,” says Gordon.

In addition, many businesses are already convinced of the effectiveness of their campaigns. “If you believe your ads work, then running an experiment looks like you are wasting money,” says Zettelmeyer.

Finally, there is a broad consensus among businesses that less costly observational methods work—if not perfectly, at least well enough. These methods offer workarounds for not having a properly randomized control group, like matching two groups of users across a variety of demographic characteristics, or comparing the same group before and after a campaign.

Do they work well enough? It is this assumption that the authors put to the test.

On behalf of clients, Facebook conducted RCTs to measure the effectiveness of twelve different advertising campaigns, each of which ran in the United States beginning in January 2015. The campaigns were large, involving over a million users each (for a total of 1.4 billion impressions), and spanned a variety of companies and industries. The Kellogg and Facebook researchers analyzed these campaigns.

Because Facebook requires users to log in across browsers and devices, the authors were able to reliably follow a person’s journey from ad to purchase, even if they were moving between phone and computer in the process. Additionally the authors were able to use anonymized demographic information in their estimations.

“They can track the two key things we care about,” explains Gordon, “which is when you get exposed to an ad, and when you convert on any of the devices.”

Powerful Forces Working Against Observational Methods

With accurate measurements from the RCTs in hand, the authors then tested a variety of observational methods to see how they stacked up.

The most straightforward observational method consists of simply comparing the conversion rates of those who see an ad and those who do not. But unfortunately, these two groups tend to differ in ways that go beyond advertising. For instance, users who rarely log into Facebook are less likely to be shown an ad—and they are probably also less likely to make online purchases, perhaps because they are less likely to be online in the first place.

“Even though the ad did nothing, the person who saw the ad is going to look like they purchased more than the person who didn’t see the ad,” says Zettelmeyer.

Moreover, advertisers put an enormous amount of effort into ensuring that ads are targeted to the people most likely to respond to them. An advertiser might initially target women between the ages of 18 and 49—but if Facebook’s ad-targeting algorithm learns that conversion rates are higher for younger women, it will fine-tune the target audience to get the most bang for their client’s buck. This further muddles efforts to discern causality when not using an RCT: Did seeing the ad make people buy, or were the people who buy simply more likely to see the ad?

“There are really, really powerful forces that make these two [exposed and unexposed] groups not the same,” says Zettelmeyer.

Indeed, the authors found that this comparison tended to wildly inflate lift: measuring it at 416%, for instance, when an RCT suggested a lift closer to 77%.

Other observational methods attempt to counteract these powerful forces: comparing the conversion rates of exposed and unexposed users only after the groups have been matched for a variety of traits, using sophisticated “propensity scoring” to adjust for differences between the groups, conducting matched-market tests, or comparing conversion rates for the same group of users before and after a campaign.

But when put to the test, no clear winner emerged. Moreover, not a single observational method performed reliably well.

“Sometimes, in some studies, they do pretty well,” says Gordon. “In other studies they don’t just perform a little bit badly, they do horribly.”

In fact, just how poorly the observational methods fared was a surprise to even the authors. This was particularly true for the promising matched-market test, where large, demographically similar geographic markets are paired up, and one is randomly assigned to be targeted in a campaign, while the other is held as a control. In a sense, the matched-market test is a true experiment, but at the level of the market, instead of the individual.

And yet it produced results that were overly dependent on which matched market ended up in which condition.

“The degree of variation was stunning,” says Zettelmeyer.

No Good Substitutes

The results suggest that, in the absence of an RCT, it is difficult to determine an ad’s lift with any degree of accuracy. It is also nearly impossible to predict in advance just how inaccurate a particular observational technique will be. This means that companies cannot get away with running a single RCT, determining how much their favorite way of measuring is “off”—by a factor of two, for instance—and then adjusting their future measurements by that amount.

“Dividing by two is almost certainly wrong over time and across studies,” says Zettelmeyer. “It’s not constant. It’s not a good rule-of-thumb.”

The results also highlight the challenge of finding an appropriate control group outside of the context of a true experiment—a takeaway that applies beyond marketing.

As academic researchers, “we’re used to making lots of assumptions when we create models,” says Gordon. “What we’re not always forced to do is to really think long and hard about whether the assumptions are actually correct or not.”

The authors acknowledge that observational methods are not going anywhere. Nor should they, as they work well in many settings outside of advertising. Their sheer convenience allows them to provide data scientists with larger and richer datasets than RCTs are likely to provide anytime soon.

“It just happens to be that, despite our best efforts at this point, we can’t say that for advertising measurement these methods actually work well as a substitute for running RCTs,” says Zettelmeyer.

His takeaway for marketers? “Look, we understand that in many cases you can’t run an RCT. But please, if you can run it, for heaven’s sake, do.”