Is Your Digital-Advertising Campaign Working?
Skip to content
Podcast | Insight Unpacked Season 1: Extraordinary Brands and How to Build Them
Data Analytics Marketing Mar 11, 2016

Is Your Digital-Advertising Campaign Working?

If you are not running a randomized controlled experiment, you probably don’t know.

Measuring the success of a digital advertising campaign.


Based on the research of

Brett Gordon

Florian Zettelmeyer

Neha Bhargava

Dan Chapsky

Today a digital advertising campaign can reach a potential customer as she skims the news on her morning train ride, takes a break from emailing at her desk, scours restaurant reviews over cocktails, and admires a friend’s vacation photos after dessert.

Add Insight
to your inbox.

A proliferation of digital platforms and devices makes this kind of campaign possible. But it also makes measuring the campaign’s success challenging.

The gold standard for measuring an advertisement’s “lift”—that is, its direct effect on a user’s probability of converting—is a true experiment, or “randomized controlled trial” (RCT), as data scientists call it. But Florian Zettelmeyer, a professor of marketing at the Kellogg School, explains, “RCTs are hard to run. They can be expensive to run. They can take a lot of coordination to run.” So many marketers rely on a litany of alternative methods, easier to implement and often capable of drawing conclusions from data that have already been collected.

Just how well do these alternative methods work?

Zettelmeyer and Brett Gordon, an associate professor of marketing at Kellogg, recently coauthored a whitepaper with Facebook researchers Neha Bhargava and Dan Chapsky in an effort to find out. The upshot: even across a single platform, and using the exact same advertising studies, these alternative methods tend to be inaccurate—sometimes wildly so.

Benefits of a True Experiment

To accurately measure the lift of an ad placed on Facebook, Google, or another digital platform, it is not enough for marketers to calculate how likely it is that someone who sees the ad will “convert,” the industry’s lingo for an event the advertiser cares about, for example, a purchase, a registration, or page visit. They also must determine whether the conversion happened because of the ad—that is, whether the ad caused the conversion. But causality is surprisingly difficult to pin down. A perfect test would require the impossible: two parallel worlds, identical except that in the first world someone sees an ad and in the second that same person sees no ad.

“The degree of variation was stunning.” - Florian Zettelmeyer

Because parallel worlds remain stubbornly unavailable to researchers, the next best thing is an RCT, where individuals are randomly divided into a treatment group, which sees the ads, and a control group, which does not. Randomization ensures that the two groups do not differ in important ways, like demographics, life style, or personality.

And yet, for a variety of reasons, many marketers do not use RCTs.

For one, until recently many platforms did not offer the capability (and some still do not). RCTs can be time-consuming to implement correctly, requiring hours of additional work from engineers and data scientists without necessarily generating any additional income. Nor is it obvious that improved accuracy would work in a given platform’s favor, as advertising is not always particularly effective. “A lot of people in the industry simply aren’t incentivized to make sure you get the right estimate,” says Gordon.

In addition, many businesses are already convinced of the effectiveness of their campaigns. “If you believe your ads work, then running an experiment looks like you are wasting money,” says Zettelmeyer.

Finally, there is a broad consensus among businesses that less costly observational methods work—if not perfectly, at least well enough. These methods offer workarounds for not having a properly randomized control group, like matching two groups of users across a variety of demographic characteristics, or comparing the same group before and after a campaign.

Do they work well enough? It is this assumption that the authors put to the test.

On behalf of clients, Facebook conducted RCTs to measure the effectiveness of twelve different advertising campaigns, each of which ran in the United States beginning in January 2015. The campaigns were large, involving over a million users each (for a total of 1.4 billion impressions), and spanned a variety of companies and industries. The Kellogg and Facebook researchers analyzed these campaigns.

Because Facebook requires users to log in across browsers and devices, the authors were able to reliably follow a person’s journey from ad to purchase, even if they were moving between phone and computer in the process. Additionally the authors were able to use anonymized demographic information in their estimations.

“They can track the two key things we care about,” explains Gordon, “which is when you get exposed to an ad, and when you convert on any of the devices.”

Powerful Forces Working Against Observational Methods

With accurate measurements from the RCTs in hand, the authors then tested a variety of observational methods to see how they stacked up.

The most straightforward observational method consists of simply comparing the conversion rates of those who see an ad and those who do not. But unfortunately, these two groups tend to differ in ways that go beyond advertising. For instance, users who rarely log into Facebook are less likely to be shown an ad—and they are probably also less likely to make online purchases, perhaps because they are less likely to be online in the first place.

“Even though the ad did nothing, the person who saw the ad is going to look like they purchased more than the person who didn’t see the ad,” says Zettelmeyer.

Moreover, advertisers put an enormous amount of effort into ensuring that ads are targeted to the people most likely to respond to them. An advertiser might initially target women between the ages of 18 and 49—but if Facebook’s ad-targeting algorithm learns that conversion rates are higher for younger women, it will fine-tune the target audience to get the most bang for their client’s buck. This further muddles efforts to discern causality when not using an RCT: Did seeing the ad make people buy, or were the people who buy simply more likely to see the ad?

“There are really, really powerful forces that make these two [exposed and unexposed] groups not the same,” says Zettelmeyer.

Indeed, the authors found that this comparison tended to wildly inflate lift: measuring it at 416%, for instance, when an RCT suggested a lift closer to 77%.

Other observational methods attempt to counteract these powerful forces: comparing the conversion rates of exposed and unexposed users only after the groups have been matched for a variety of traits, using sophisticated “propensity scoring” to adjust for differences between the groups, conducting matched-market tests, or comparing conversion rates for the same group of users before and after a campaign.

But when put to the test, no clear winner emerged. Moreover, not a single observational method performed reliably well.

“Sometimes, in some studies, they do pretty well,” says Gordon. “In other studies they don’t just perform a little bit badly, they do horribly.”

In fact, just how poorly the observational methods fared was a surprise to even the authors. This was particularly true for the promising matched-market test, where large, demographically similar geographic markets are paired up, and one is randomly assigned to be targeted in a campaign, while the other is held as a control. In a sense, the matched-market test is a true experiment, but at the level of the market, instead of the individual.

And yet it produced results that were overly dependent on which matched market ended up in which condition.

“The degree of variation was stunning,” says Zettelmeyer.

No Good Substitutes

The results suggest that, in the absence of an RCT, it is difficult to determine an ad’s lift with any degree of accuracy. It is also nearly impossible to predict in advance just how inaccurate a particular observational technique will be. This means that companies cannot get away with running a single RCT, determining how much their favorite way of measuring is “off”—by a factor of two, for instance—and then adjusting their future measurements by that amount.

“Dividing by two is almost certainly wrong over time and across studies,” says Zettelmeyer. “It’s not constant. It’s not a good rule-of-thumb.”

The results also highlight the challenge of finding an appropriate control group outside of the context of a true experiment—a takeaway that applies beyond marketing.

As academic researchers, “we’re used to making lots of assumptions when we create models,” says Gordon. “What we’re not always forced to do is to really think long and hard about whether the assumptions are actually correct or not.”

The authors acknowledge that observational methods are not going anywhere. Nor should they, as they work well in many settings outside of advertising. Their sheer convenience allows them to provide data scientists with larger and richer datasets than RCTs are likely to provide anytime soon.

“It just happens to be that, despite our best efforts at this point, we can’t say that for advertising measurement these methods actually work well as a substitute for running RCTs,” says Zettelmeyer.

His takeaway for marketers? “Look, we understand that in many cases you can’t run an RCT. But please, if you can run it, for heaven’s sake, do.”

Featured Faculty

Professor of Marketing

Nancy L. Ertle Professor of Marketing; Faculty Director, Program on Data Analytics at Kellogg; Chair of Marketing Department

About the Writer
Jessica Love is editor in chief of Kellogg Insight.
About the Research

Gordon, Brett, Florian Zettelmeyer, Neha Bhargava, and Dan Chapsky. 2016. “A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook.” White paper, Kellogg School of Management, Northwestern University.

Read the original

Most Popular This Week
  1. Your Team Doesn’t Need You to Be the Hero
    Too many leaders instinctively try to fix a crisis themselves. A U.S. Army colonel explains how to curb this tendency in yourself and allow your teams to flourish.
    person with red cape trying to put out fire while firefighters stand by.
  2. What Triggers a Career Hot Streak?
    New research reveals a recipe for success.
    Collage of sculptor's work culminating in Artist of the Year recognition
  3. What’s the Secret to Successful Innovation?
    Hint: it’s not the product itself.
    standing woman speaking with man seated on stool
  4. Which Form of Government Is Best?
    Democracies may not outlast dictatorships, but they adapt better.
    Is democracy the best form of government?
  5. How Much Do Campaign Ads Matter?
    Tone is key, according to new research, which found that a change in TV ad strategy could have altered the results of the 2000 presidential election.
    Political advertisements on television next to polling place
  6. What Went Wrong with FTX—and What’s Next for Crypto?
    One key issue will be introducing regulation without strangling innovation, a fintech expert explains.
    stock trader surrounded by computer monitors
  7. How Are Black–White Biracial People Perceived in Terms of Race?
    Understanding the answer—and why black and white Americans may percieve biracial people differently—is increasingly important in a multiracial society.
    How are biracial people perceived in terms of race
  8. Immigrants to the U.S. Create More Jobs than They Take
    A new study finds that immigrants are far more likely to found companies—both large and small—than native-born Americans.
    Immigrant CEO welcomes new hires
  9. How Experts Make Complex Decisions
    By studying 200 million chess moves, researchers shed light on what gives players an advantage—and what trips them up.
    two people playing chess
  10. Yes, Consumers Care if Your Product Is Ethical
    New research shows that morality matters—but it’s in the eye of the beholder.
    woman chooses organic lettuce in grocery
  11. Why Well-Meaning NGOs Sometimes Do More Harm than Good
    Studies of aid groups in Ghana and Uganda show why it’s so important to coordinate with local governments and institutions.
    To succeed, foreign aid and health programs need buy-in and coordination with local partners.
  12. Product Q&A Forums Hold a Lot of Promise. Here’s How to Make Them Work.
    The key to these online communities, where users can ask and answer questions, is how many questions get useful answers.
    man sits at computer reading Q&A forum
  13. What Went Wrong at AIG?
    Unpacking the insurance giant's collapse during the 2008 financial crisis.
    What went wrong during the AIG financial crisis?
  14. When Do Open Borders Make Economic Sense?
    A new study provides a window into the logic behind various immigration policies.
    How immigration affects the economy depends on taxation and worker skills.
  15. What the New Climate Bill Means for the U.S.—and the World
    The Inflation Reduction Act won’t reverse inflation or halt climate change, but it's still a big deal.
    energy bill with solar panels wind turbines and pipelines
  16. Post-War Reconstruction Is a Good Investment
    Ukraine’s European neighbors will need to make a major financial commitment to help rebuild its economy after the war. Fortunately, as the legacy of the post–World War II Marshall Plan shows, investing in Ukraine's future will also serve Europe's own long-term interests.
    two people look out over a city
  17. How Has Marketing Changed over the Past Half-Century?
    Phil Kotler’s groundbreaking textbook came out 55 years ago. Sixteen editions later, he and coauthor Alexander Chernev discuss how big data, social media, and purpose-driven branding are moving the field forward.
    people in 1967 and 2022 react to advertising
  18. The Political Divide in America Goes Beyond Polarization and Tribalism
    These days, political identity functions a lot like religious identity.
    people engage in conflict with swords
More in Data Analytics