Advertisers pay good money to reach potential customers as they read the news on their phones, or shop on their tablets, or skim their Facebook feeds at work.
But determining whether those digital ads actually lead to new purchases? That can be surprisingly tricky.
In this Insight in Person podcast, we chat with Kellogg faculty members, as well as researchers from Facebook, to find out why many of the techniques that advertisers use to measure effectiveness are so inaccurate. We also learn what advertisers should be doing whenever possible.
Fred SCHMALZ: What ads do you see when you look at Facebook?
Emily STONE: I have an ad about keeping my pets safe with an ID tag … even though I actually currently have no pets. So that’s confusing.
Kendra BUSSE: Primary.com. Baby stuff! I don’t have babies, but these are really cute onesies.
Want to learn more? In this video interview, the Kellogg and Facebook researchers discuss their methods and the implications of their work.
Devin RAPSON: I’m just getting a series of whiskies and deodorants, which just says to me they think I’m a very typical young man. How wrong they are.
BUSSE: I’ve got an ad for a new movie with Michael Fassbinder. Facebook scored on that one!
SCHMALZ: We know the ads that show up in our Instagram or Twitter or Facebook feeds didn’t get there by accident. Maybe we are the right age and gender; maybe something about our previous online behavior says: You might be interested in this product.
But have you ever wondered how effective these ads really are? How likely they are to cause us to buy a product or visit a site that we otherwise wouldn’t have?
It turns out that determining the effectiveness of online ads is really hard. Advertisers themselves don’t always know how to do it. And the stakes are high.
Brett GORDON: Advertising powers the Internet. The money of advertising powers the Internet.
SCHMALZ: Welcome to Insight In Person, the monthly podcast of the Kellogg School of Management. I’m your host, Fred Schmalz.
On this month’s episode, we explore the surprisingly complex and ever-evolving world of digital advertising.
It’s a topic that two Kellogg School professors recently tackled by teaming up with scientists from Facebook. Together, they explored a very interesting question: What do advertisers actually know about the effectiveness of their campaigns—and what do they just think they know?
So, stay with us….
GORDON: When Internet advertising came about first—and we think around 1994 was the first Internet display ad—initially people thought that this would herald a new age of advertising measurement.
SCHMALZ: That’s Brett Gordon, an associate professor of marketing at the Kellogg School.
Historically, measuring the value of advertising has been exceptionally difficult.
For TV ads, for instance, marketers can know roughly how many people in a given market were exposed. They can also determine how many people made a purchase. But there is no good way to tell whether these two groups of people are the same.
Marketers hoped online ads would be different.
GORDON: If I see you click, I can typically follow you through to some kind of outcome, like buying something, putting something in your shopping cart, and buying it.
SCHMALZ: That all sounds great. Digital advertising for the win! But, Gordon explains, the story is rapidly changing. It’s no longer a given that advertisers will be able to track a customer along their journey.
GORDON: You have your laptop, you have your desktop, you have your work desktop , you have your home desktop, you have your mobile phone for work, you have your mobile phone for home, you have iPad, your Samsung device, et cetera, and the problem is that most marketers don’t know if it’s one person across all of these devices or if it’s ten people across all of these devices.
It’s true that we can see someone maybe convert on one device, but to link that to what they were exposed to on another device is very difficult.
SCHMALZ: In other words, online advertising has a lot of the same measurement problems as any other platform.
One of the thorniest of those problems is determining whether seeing an ad actually caused someone to purchase. Maybe your friend Bob was going to buy a new hot tub anyway, and seeing that ad had nothing to do with his decision.
Here’s Florian Zettelmeyer, a professor of marketing at Kellogg.
Florian ZETTELMEYER: In order to determine the effect of the ad, you need to know what would have happened if somebody had actually never seen the ad.
SCHMALZ: So what you’d really want to do, he says, is take a situation—a specific set of customers in a specific setting at a specific moment in time—and then clone it, creating two identical worlds, so to speak.
ZETTELMEYER: What we would then like to do is, if these two worlds were completely the same, we would like to show in one world an ad and in the other world we would like to not show an ad. If there is a difference in outcome, like a purchase or a web page view or something else that you care about as an advertiser, it has to be driven by the ad since there are no other differences between these two worlds.
SCHMALZ: But obviously cloning worlds just isn’t possible—even for marketers.
So researchers in a variety of scientific fields have come up with a next-best alternative: a randomized controlled trial.
The basic idea? Take a group of people and randomly assign them to either a test group—which in an advertising study would be people targeted for an ad—and a control group, which is not targeted.
ZETTELMEYER: You see, this doesn’t exactly replicate this idea of having two cloned worlds, because obviously the two groups consists of different individuals and so they can’t really be clones of each other. But by randomly making people show up in one group or the other group, we create what we refer to as probabilistically equivalent groups, and what that basically means in lay terms is you create as close as you can possibly get a true apples-to-apples comparison.
SCHMALZ: If the groups are big enough, this randomization should make them comparable.
Randomized controlled trials, or RCTs for short, are considered the “gold standard” for determining causality across the sciences.
Yet despite that, you don’t always see a lot of advertisers using them.
In fact, there isn’t really a “standard” way of measuring effectiveness at all.
Here’s Neha Bhargava, advertising research manager at Facebook, whose role gives her unique insight into what advertisers actually do.
Neha BHARGAVA: There’s basically a bajillion different types of ways that people try to measure ad effectiveness, so part of it depends on what data you actually have at your fingertips.
SCHMALZ: Building a truly randomized control group requires foresight, additional resources, and technical capabilities that not all advertising platforms have.
Remember that whole tracking problem?
Here’s Bhargava’s colleague, Facebook data scientist Daniel Chapsky, to explain.
Daniel CHAPSKY: If I am serving an ad on some display network and then I see the same ad both on my desktop and on my mobile device, there’s a good chance that whoever is serving that ad cannot link my mobile device to my desktop device, which immediately can ruin the result of your tests because the controlled group might get exposed and vice versa.
So you actually need more than just money. You need an infrastructural change in what you can do. Right now, it’s impossible to run a high-quality RCT on every channel that you market on.
SCHMALZ: So what do advertisers do instead? Put simply, they have developed techniques to get around having a proper control group.
One technique might be to compare the purchases of people who were exposed to the ads to the purchases of people who were not exposed. But for a variety of reasons, the people who actually see an ad can be quite different from those who don’t.
A more sophisticated technique might compare those who see an add to those who don’t but are similar along a variety of dimensions, like demographics, or zip code, or previous online behavior.
Here’s Zettelmeyer again.
ZETTELMEYER: The hope is that you have enough information about them that you’ve moved from apples-to-oranges to apples-to-apples. The challenge in doing this, of course, is that you can only create an apples-to-apples comparison on the stuff you know about people. And so what these methods have trouble dealing with is if an important difference that determines how likely you are to buy a product turns out to be a difference that we don’t know about.
SCHMALZ: There are other methods, too.
Some rely on before–after comparisons: run an ad for two weeks, don’t for the next two. Or try a matched-market test, which is kind of like an RCT, except the randomization is done on markets instead of individuals: San Francisco and Boston randomly participate in a campaign; New York and Chicago don’t.
But just how good are all of these workarounds? Can any of them accurately stand in for a randomized controlled trial?
SCHMALZ: Enter Facebook. The company was the perfect research partner for Zettelmeyer and Gordon in tackling these questions.
Because Facebook requires its users to log in, it doesn’t have the same tracking problems as other platforms—meaning Facebook knows it’s you, whether you are on your phone on the train, your tablet in a coffee shop, or your desktop at work. This allows it to conduct true RCTs. And that means the researchers could compare how well ads performed in RCTs to how well advertisers thought they performed when they used different measurement methods.
What they found surprised even them. Here’s Zettelmeyer.
ZETTELMEYER: There was a big discrepancy between these commonly used approaches, these observational methods, and the randomized control trials in our studies.
SCHMALZ: In some cases, “big discrepancy” is quite the understatement. These methods might inflate actual estimates by 100, 300, even 500 percent. Meaning companies might think their ads are 500 percent more effective than they actually are.
Even more worryingly, the same method might be relatively accurate in one study, but wildly inaccurate in the next. Here’s Gordon.
GORDON: I think this makes it really challenging for an advertiser to then take these results and generate some kind of rule of thumb that says, well if my observational methods consistently seem to be twice that of what the experimental result is, then I don’t need to run any more experiments, I can always just adjust my results by a factor of two, or divide by two. But the problem is, sometimes it might be a factor of two, sometimes it might be a factor of ten, and if I was an advertiser, I would feel I think fairly uncomfortable having such a great uncertainty over my eventual ROI.
SCHMALZ: This, of course, presents a dilemma for advertisers. The methods they are using are inaccurate, but they may not always be able to use the more accurate RCTs instead.
But that doesn’t mean that that all hope is lost. Bhargava points out that you might not need to run RCTs in every case. If you can run an RCT once, there may be results that you can extrapolate to inform other strategic marketing decisions.
Have a theory about the optimal frequency with which to regale people like Bob with hot-tub ads? Test it.
BHARGAVA: Should I show him an ad once a week, three times a week, five times a week? And so if you learn something like that on one platform, you can start trying to figure out how do I then use that same information across other platforms and in my marketing more generally.
SCHMALZ: But perhaps the biggest takeaway from this study is to reconsider whether you can afford not to do RCTs.
Here’s Zettelmeyer again.
ZETTELMEYER: So our recommendation would be that if you have the option of running a randomized control trial, then you should really try to do it. If you can’t do it, then you need to be as smart as you can about the methods and the data that you’re going to use to substitute for it, but you have to understand that you could potentially be pretty far off from the true advertising effectiveness.
SCHMALZ: And if an RCT is off the table, at least make sure you are working with the best data possible. Because while none of the alternative methods that the researchers tested truly stacked up, some did perform better than others. And the winners tended to take into account rich user data.
GORDON: If I was a firm and I was thinking about whether I put more resources into figuring out better methods or purchasing better methods rather than just having better data to bring to the problem, I would put my bet on data any day.
SCHMALZ: Next for these researchers is getting the word out about their findings. Because advertisers may not know what they don’t know.
BHARGAVA: One of the biggest problems in the industry is that, normally, regardless of how it’s being measured, results are being presented back as a causal result from an RCT. And so when an advertiser is shown data from one department where it was from RCT and a different department from the same advertiser where it wasn’t, they’re being compared in the exact same way and assumed that they’re both causal and that it’s a true incremental value.
And so in this complicated scenario, it’s really hard for them to understand how to look at those results differently.
SCHMALZ: But though advertisers would benefit from paying attention to how various measurements were derived, not everyone is excited to get the message.
GORDON: The reception has been really interesting. There’s been one group of people who, when we tell them what the results are, it seems to coincide exactly with what they were expecting.
Then there’s another camp, which I think was very much hoping that these methods could be reasonable substitutes. When we presented these results at a very large advertising conference, a person sitting next to me who I think worked at an ad agency just said, “Really great results, really great research, I hope you’re completely wrong.”
SCHMALZ: Because, of course, it takes real work to do RCTs correctly.
Plus, if the results are right, then some advertisers may not be getting the bang for the buck they have come to expect. And it may not just be advertisers who would prefer to keep their heads in the sand a little longer.
Improvements in the ability to measure ad effectiveness could have a pretty huge effect on the Internet itself. Gordon explains that total digital advertising spending for 2016 is estimated at $68 billion dollars—and a lot of online services and websites depend on ad revenue to stay in business.
GORDON: A lot of news websites provide content for free; a lot of other sites present content for free, because they’re ad-supported. So advertising really powers a lot of the Internet, and unfortunately, it’s unclear how much of that money would persist if everyone could measure the effectiveness of those ads as accurately as possible.
SCHMALZ: This doesn’t necessarily mean there will be less money spent on digital advertising overall, Gordon says. That amount might even increase as marketers gain a better understanding of what is and is not effective. But, a reallocation of spending will likely impact what the web looks like for all of us.
GORDON: It’ll be interesting to see going forward how some of that maybe reallocates across different channels, maybe back to TV or more into display instead of search, and that I think will start to have an effect on the types of sites and types of content that’s available to just everyday users.
SCHMALZ: This program was produced by Jessica Love, Kate Proto, Fred Schmalz, Emily Stone, and Michael Spikes.
Special thanks to Kellogg School professors Florian Zettelmeyer and Brett Gordon, as well as Facebook researchers Neha Bhargava and Dan Chapsky. Thanks, also, to Kendra Busse and Devin Rapson who talked with us about the Facebook ads they see.
You can stream or download our monthly podcast from iTunes, Google Play, or from our website, where you can read more on digital marketing, advertising, and data analytics. Visit us at insight.kellogg.northwestern.edu. We’ll be back next month with another Insight In Person podcast.