Is Your Digital Advertising Campaign Working?
Skip to content
Data AnalyticsStrategy Marketing Mar 11, 2016

Is Your Dig­i­tal Adver­tis­ing Cam­paign Working?

If you are not run­ning a ran­dom­ized con­trolled exper­i­ment, you prob­a­bly don’t know.

Measuring the success of a digital advertising campaign.

Retrorocket

Based on the research of

Brett Gordon

Florian Zettelmeyer

Neha Bhargava

Dan Chapsky

Today a dig­i­tal adver­tis­ing cam­paign can reach a poten­tial cus­tomer as she skims the news on her morn­ing train ride, takes a break from email­ing at her desk, scours restau­rant reviews over cock­tails, and admires a friend’s vaca­tion pho­tos after dessert.

Add Insight
to your inbox.

We’ll send you one email a week with content you actually want to read, curated by the Insight team.

A pro­lif­er­a­tion of dig­i­tal plat­forms and devices makes this kind of cam­paign pos­si­ble. But it also makes mea­sur­ing the campaign’s suc­cess challenging.

The gold stan­dard for mea­sur­ing an advertisement’s lift” — that is, its direct effect on a user’s prob­a­bil­i­ty of con­vert­ing — is a true exper­i­ment, or ran­dom­ized con­trolled tri­al” (RCT), as data sci­en­tists call it. But Flo­ri­an Zettelmey­er, a pro­fes­sor of mar­ket­ing at the Kel­logg School, explains, RCTs are hard to run. They can be expen­sive to run. They can take a lot of coor­di­na­tion to run.” So many mar­keters rely on a litany of alter­na­tive meth­ods, eas­i­er to imple­ment and often capa­ble of draw­ing con­clu­sions from data that have already been collected.

Just how well do these alter­na­tive meth­ods work?

Zettelmey­er and Brett Gor­don, an asso­ciate pro­fes­sor of mar­ket­ing at Kel­logg, recent­ly coau­thored a whitepa­per with Face­book researchers Neha Bhar­ga­va and Dan Chap­sky in an effort to find out. The upshot: even across a sin­gle plat­form, and using the exact same adver­tis­ing stud­ies, these alter­na­tive meth­ods tend to be inac­cu­rate — some­times wild­ly so.

Ben­e­fits of a True Experiment

To accu­rate­ly mea­sure the lift of an ad placed on Face­book, Google, or anoth­er dig­i­tal plat­form, it is not enough for mar­keters to cal­cu­late how like­ly it is that some­one who sees the ad will con­vert,” the industry’s lin­go for an event the adver­tis­er cares about, for exam­ple, a pur­chase, a reg­is­tra­tion, or page vis­it. They also must deter­mine whether the con­ver­sion hap­pened because of the ad — that is, whether the ad caused the con­ver­sion. But causal­i­ty is sur­pris­ing­ly dif­fi­cult to pin down. A per­fect test would require the impos­si­ble: two par­al­lel worlds, iden­ti­cal except that in the first world some­one sees an ad and in the sec­ond that same per­son sees no ad.

The degree of vari­a­tion was stun­ning.” — Flo­ri­an Zettelmeyer

Because par­al­lel worlds remain stub­born­ly unavail­able to researchers, the next best thing is an RCT, where indi­vid­u­als are ran­dom­ly divid­ed into a treat­ment group, which sees the ads, and a con­trol group, which does not. Ran­dom­iza­tion ensures that the two groups do not dif­fer in impor­tant ways, like demo­graph­ics, life style, or personality.

And yet, for a vari­ety of rea­sons, many mar­keters do not use RCTs.

For one, until recent­ly many plat­forms did not offer the capa­bil­i­ty (and some still do not). RCTs can be time-con­sum­ing to imple­ment cor­rect­ly, requir­ing hours of addi­tion­al work from engi­neers and data sci­en­tists with­out nec­es­sar­i­ly gen­er­at­ing any addi­tion­al income. Nor is it obvi­ous that improved accu­ra­cy would work in a giv­en platform’s favor, as adver­tis­ing is not always par­tic­u­lar­ly effec­tive. A lot of peo­ple in the indus­try sim­ply aren’t incen­tivized to make sure you get the right esti­mate,” says Gordon.

In addi­tion, many busi­ness­es are already con­vinced of the effec­tive­ness of their cam­paigns. If you believe your ads work, then run­ning an exper­i­ment looks like you are wast­ing mon­ey,” says Zettelmeyer.

Final­ly, there is a broad con­sen­sus among busi­ness­es that less cost­ly obser­va­tion­al meth­ods work — if not per­fect­ly, at least well enough. These meth­ods offer workarounds for not hav­ing a prop­er­ly ran­dom­ized con­trol group, like match­ing two groups of users across a vari­ety of demo­graph­ic char­ac­ter­is­tics, or com­par­ing the same group before and after a campaign.

Do they work well enough? It is this assump­tion that the authors put to the test.

On behalf of clients, Face­book con­duct­ed RCTs to mea­sure the effec­tive­ness of twelve dif­fer­ent adver­tis­ing cam­paigns, each of which ran in the Unit­ed States begin­ning in Jan­u­ary 2015. The cam­paigns were large, involv­ing over a mil­lion users each (for a total of 1.4 bil­lion impres­sions), and spanned a vari­ety of com­pa­nies and indus­tries. The Kel­logg and Face­book researchers ana­lyzed these campaigns.

Because Face­book requires users to log in across browsers and devices, the authors were able to reli­ably fol­low a person’s jour­ney from ad to pur­chase, even if they were mov­ing between phone and com­put­er in the process. Addi­tion­al­ly the authors were able to use anonymized demo­graph­ic infor­ma­tion in their estimations.

They can track the two key things we care about,” explains Gor­don, which is when you get exposed to an ad, and when you con­vert on any of the devices.”

Pow­er­ful Forces Work­ing Against Obser­va­tion­al Methods

With accu­rate mea­sure­ments from the RCTs in hand, the authors then test­ed a vari­ety of obser­va­tion­al meth­ods to see how they stacked up.

The most straight­for­ward obser­va­tion­al method con­sists of sim­ply com­par­ing the con­ver­sion rates of those who see an ad and those who do not. But unfor­tu­nate­ly, these two groups tend to dif­fer in ways that go beyond adver­tis­ing. For instance, users who rarely log into Face­book are less like­ly to be shown an ad — and they are prob­a­bly also less like­ly to make online pur­chas­es, per­haps because they are less like­ly to be online in the first place.

Even though the ad did noth­ing, the per­son who saw the ad is going to look like they pur­chased more than the per­son who didn’t see the ad,” says Zettelmeyer.

More­over, adver­tis­ers put an enor­mous amount of effort into ensur­ing that ads are tar­get­ed to the peo­ple most like­ly to respond to them. An adver­tis­er might ini­tial­ly tar­get women between the ages of 18 and 49 — but if Facebook’s ad-tar­get­ing algo­rithm learns that con­ver­sion rates are high­er for younger women, it will fine-tune the tar­get audi­ence to get the most bang for their client’s buck. This fur­ther mud­dles efforts to dis­cern causal­i­ty when not using an RCT: Did see­ing the ad make peo­ple buy, or were the peo­ple who buy sim­ply more like­ly to see the ad?

There are real­ly, real­ly pow­er­ful forces that make these two [exposed and unex­posed] groups not the same,” says Zettelmeyer.

Indeed, the authors found that this com­par­i­son tend­ed to wild­ly inflate lift: mea­sur­ing it at 416%, for instance, when an RCT sug­gest­ed a lift clos­er to 77%.

Oth­er obser­va­tion­al meth­ods attempt to coun­ter­act these pow­er­ful forces: com­par­ing the con­ver­sion rates of exposed and unex­posed users only after the groups have been matched for a vari­ety of traits, using sophis­ti­cat­ed propen­si­ty scor­ing” to adjust for dif­fer­ences between the groups, con­duct­ing matched-mar­ket tests, or com­par­ing con­ver­sion rates for the same group of users before and after a campaign.

But when put to the test, no clear win­ner emerged. More­over, not a sin­gle obser­va­tion­al method per­formed reli­ably well.

Some­times, in some stud­ies, they do pret­ty well,” says Gor­don. In oth­er stud­ies they don’t just per­form a lit­tle bit bad­ly, they do horribly.”

In fact, just how poor­ly the obser­va­tion­al meth­ods fared was a sur­prise to even the authors. This was par­tic­u­lar­ly true for the promis­ing matched-mar­ket test, where large, demo­graph­i­cal­ly sim­i­lar geo­graph­ic mar­kets are paired up, and one is ran­dom­ly assigned to be tar­get­ed in a cam­paign, while the oth­er is held as a con­trol. In a sense, the matched-mar­ket test is a true exper­i­ment, but at the lev­el of the mar­ket, instead of the individual.

And yet it pro­duced results that were over­ly depen­dent on which matched mar­ket end­ed up in which condition.

The degree of vari­a­tion was stun­ning,” says Zettelmeyer.

No Good Substitutes

The results sug­gest that, in the absence of an RCT, it is dif­fi­cult to deter­mine an ad’s lift with any degree of accu­ra­cy. It is also near­ly impos­si­ble to pre­dict in advance just how inac­cu­rate a par­tic­u­lar obser­va­tion­al tech­nique will be. This means that com­pa­nies can­not get away with run­ning a sin­gle RCT, deter­min­ing how much their favorite way of mea­sur­ing is off” — by a fac­tor of two, for instance — and then adjust­ing their future mea­sure­ments by that amount.

Divid­ing by two is almost cer­tain­ly wrong over time and across stud­ies,” says Zettelmey­er. It’s not con­stant. It’s not a good rule-of-thumb.”

The results also high­light the chal­lenge of find­ing an appro­pri­ate con­trol group out­side of the con­text of a true exper­i­ment — a take­away that applies beyond marketing.

As aca­d­e­m­ic researchers, we’re used to mak­ing lots of assump­tions when we cre­ate mod­els,” says Gor­don. What we’re not always forced to do is to real­ly think long and hard about whether the assump­tions are actu­al­ly cor­rect or not.”

The authors acknowl­edge that obser­va­tion­al meth­ods are not going any­where. Nor should they, as they work well in many set­tings out­side of adver­tis­ing. Their sheer con­ve­nience allows them to pro­vide data sci­en­tists with larg­er and rich­er datasets than RCTs are like­ly to pro­vide any­time soon.

It just hap­pens to be that, despite our best efforts at this point, we can’t say that for adver­tis­ing mea­sure­ment these meth­ods actu­al­ly work well as a sub­sti­tute for run­ning RCTs,” says Zettelmeyer.

His take­away for mar­keters? Look, we under­stand that in many cas­es you can’t run an RCT. But please, if you can run it, for heaven’s sake, do.”

About the Writer

Jessica Love is editor in chief of Kellogg Insight.

About the Research

Gordon, Brett, Florian Zettelmeyer, Neha Bhargava, and Dan Chapsky. 2016. “A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook.” White paper, Kellogg School of Management, Northwestern University.

Read the original

Suggested For You

Most Popular

Organizations

How Are Black – White Bira­cial Peo­ple Per­ceived in Terms of Race?

Under­stand­ing the answer — and why black and white Amer­i­cans’ respons­es may dif­fer — is increas­ing­ly impor­tant in a mul­tira­cial society.

Careers

Pod­cast: Our Most Pop­u­lar Advice on Advanc­ing Your Career

Here’s how to con­nect with head­hunters, deliv­er with data, and ensure you don’t plateau professionally.

Most Popular Podcasts

Careers

Pod­cast: Our Most Pop­u­lar Advice on Improv­ing Rela­tion­ships with Colleagues

Cowork­ers can make us crazy. Here’s how to han­dle tough situations.

Social Impact

Pod­cast: How You and Your Com­pa­ny Can Lend Exper­tise to a Non­prof­it in Need

Plus: Four ques­tions to con­sid­er before becom­ing a social-impact entrepreneur.

Careers

Pod­cast: Attract Rock­star Employ­ees — or Devel­op Your Own

Find­ing and nur­tur­ing high per­form­ers isn’t easy, but it pays off.

Marketing

Pod­cast: How Music Can Change Our Mood

A Broad­way song­writer and a mar­ket­ing pro­fes­sor dis­cuss the con­nec­tion between our favorite tunes and how they make us feel.