A new study put this exact proposition to the test. One hundred sixty-four teams of researchers analyzed the same financial-market dataset separately and wrote up their conclusions in 164 short papers. Teams were then given several rounds of feedback, mimicking the kind of informal peer-review process that economists engage in before they submit to an academic journal. All the researchers involved wanted to know how much variation would exist among their different papers.
It turns out, a lot.
Data can be messy, notoriously so. And so scientists and researchers have developed reams of strategies for cleaning and analyzing and ultimately harnessing data to draw conclusions. But this unusual study—an analysis of 164 separate analyses—suggests that the decisions that go into choosing how to clean the datasets, analyze them, and come to a conclusion can in fact add just as much noise as the data themselves.
In an increasingly data-driven world, this is important to keep in mind, according to Robert Korajczyk, a professor of finance at Kellogg. Korajczyk and a former Kellogg PhD student, Dermot Murphy, now a professor at University of Illinois Chicago, served as one of the 164 research teams involved in the project.
Kellogg Insight recently spoke with Korajczyk about the experience, and what researchers and the general public can take away from the study’s surprising conclusion.
This conversation has been edited for length and clarity.
Kellogg Insight: Can you start by explaining the data that you and the other 163 research teams were asked to analyze?
Korajczyk: Yes. Each research team was given a dataset that covers 17 years of trading activity in the most liquid futures contract in Europe, the Euro Stoxx 50. That was essentially 720 million trades. And there were six research questions that teams were asked to look at. For example, did pricing get more or less efficient? Did the markets get more or less liquid? And did the fraction of agency trades change over time?
KI: These are pretty fundamental trends that you would want to understand if you were trying to gauge the health of this market.
Korajczyk: Yes, absolutely. But the broader goal of the research was what really interested me.
KI: Namely, how different research teams would approach the same set of questions?
Korajczyk: Yes. These types of “crowdsource” projects have happened in other fields, but this is the first that I’m aware of in finance. And few projects are at the scale of this particular project. It’s more typical to have 15 or 20 teams. A hundred and sixty-four is really large. So my coauthor Dermot Murphy and I decided to team up and get involved.
KI: Talk to me about the 164 different papers that were submitted. What should we understand?
Korajczyk: There’s a statistical concept called “standard error,” which tells you about the uncertainty in a parameter estimate such as a mean. The standard error of a mean is going to be larger when data are noisy and it’s going to be smaller when there are more observations.
But then there is another kind of “error” or noise to take into consideration. And that’s all the decisions that go into getting to that point. There are a lot of different ways to measure market efficiency, for instance, so that’s one of the decisions that a research team would have to make. When you clean the data, how do you handle those outliers? Do you throw them out or do you change them to another value that is large but not as large? What will be the form of your statistical model? What software are you using? Are you a good coder or a bad coder?
All those choices that are made by the research team, as well as their inherent ability, go into creating new variation in the output. We call this the “nonstandard error.”
KI: And when the teams originally submitted their papers, these nonstandard errors were about as large as the standard errors.
Korajczyk: Right, so I guess one way to think about it is if you’re going to read a paper and say, “Okay, how much credence do I place on these results?” the standard errors tell you something about the noise in the data. But the researchers made a lot of choices that I may or may not have made. So maybe that noisiness in the results is actually double what it looks like from just looking at the standard errors.
KI: Did that surprise you?
Korajczyk: It doesn’t surprise me that there was variation. The size was larger than I thought it would be. There were also some clear outliers that seemed totally outlandish to me.
Another surprise was that some of these outlandish results were there in every round. At each stage you learn something about what reviewers think or what other teams have done, and you’re allowed to revise your paper with that knowledge. But even after peer review and the opportunity to see other teams’ papers, a lot of outlandish results stuck around.
In each stage, though, the dispersion across teams did go down somewhat.
KI: It seems there were some true philosophical differences in how the questions should be approached and how the analyses should be conducted.
Korajczyk: Absolutely. And in a sense this project actually constrained these differences. We were told, “Here are the data and you’re only allowed to use these data.” You weren’t allowed to grab other data that might be relevant for answering that question and add them to the database. That would have likely increased the dispersion across teams.
KI: There’s certainly a “researchers beware!” message to this work, as you determine just how much you can trust the conclusions in the literature. This only adds to growing concern among scientists about a “replication crisis.”
Are there certain changes that you think should make to account for these ubiquitous nonstandard errors? For instance, should academic articles allot more space to methods sections so that researchers can communicate more transparently about their choices?
Korajczyk: The standard has always been that someone who’s read your paper and decides to replicate it should be able to do that from what you’ve written in the paper. If you have truncated some outliers, they should know exactly how you truncated them. Now I can’t guarantee you that every paper is written that way. But that’s the standard of good writing, and that standard has been there for a long time.
But what a paper doesn’t normally tell people is, “I tried this specification and decide not to use it, and I tried that specification and decided not to use it. And, oh yeah, I should have controlled for this other variable.”
But there are some changes for the better. These days it is much more common to have lengthy appendices available on the journal’s website. These can go into much more detail about the robustness of the results. That can give the reader some confidence that you can look at the data in a lot of different ways and get the same results. Does everyone go through and read the 120-page appendix? No, but people who are very interested in that topic might. Another thing that’s getting more common is requiring researchers to post our code. That makes it easier to replicate results and determine whether they are robust.
KI: What should the general public make of this research? If I’m reading an article in Bloomberg or The Wall Street Journal that cites a new finance study, how seriously should I take those conclusions?
Korajczyk: Well, whether it’s finance research or medical research or psychology or sociology, it’s always helpful to be skeptical. If I’m listening to the news, for instance, one thing that news reports rarely tell you is the sample size of the study. Now, with Covid-19, this is changing somewhat, but knowing the sample size tells me a lot about whether I want to take this result seriously.
I also think it’s helpful to ask, “What are the incentives?” If it is someone trying to get tenure, there is a bias toward finding statistically significant results. If it is someone who works for a money-management firm, their financial incentives could be aligned with economically significant results going in a particular direction.
Finally, be cognizant of the fact that there are many different choices that researchers have to make. If you read, “we did X” in one line in a paper or footnote, it may not be as innocuous as it seems.