The Problem with Statistics

I was thinking more about David Brooks’s anti-data column from earlier in the week, and I realized what is really bothering me.

Brooks expresses skepticism about numbers, about the limitations of raw data, about the importance of human thinking. Fine, I agree with all of this, to some extent.

But then Brooks turns around uses numbers and unquestioningly and uncritically (OK, not completely uncritically; see P.S. below). In a notorious recent case, Brooks wrote, in the context of college admissions:

You’re going to want to argue with Unz’s article all the way along, especially for its narrow, math-test-driven view of merit. But it’s potentially ground-shifting. Unz’s other big point is that Jews are vastly overrepresented at elite universities and that Jewish achievement has collapsed. In the 1970s, for example, 40 percent of top scorers in the Math Olympiad had Jewish names. Now 2.5 percent do.

But these numbers are incorrect, as I learned from a professor of oncology at the University of Wisconsin – Madison who has published a relevant article in the Notices of the American Mathematical Society on mathematics performance by gender and ethnicity on national and international mathematics competitions. Mertz found, based on her direct interviews with these students, that over 12% (her best guess is something like 16%, I think) of recent Math Olympiad participants were Jewish (and she believes the estimate of 40% for earlier years is too high). It turns out that the numbers Brooks was reported had been constructed from some sloppy counting.

My beef here, though, is not with Ron Unz, who did the sloppy counting. Unz is a political activist and it is natural for him to interpret the data in ways that are favorable to his case. Data analysis can be tricky, and even when people are trying to do their best, it’s easy to make mistakes and to get trapped by one’s own analysis (see, for example, Daryl Bem). It’s hard to get too angry at a political activist for finding what he’s looking for.

And my beef is not with David Brooks for including some faulty numbers in his column. There’s no way he has time to check every claim in everything he reads. There’s no perfect quality control, and the New York Times does not have the research to fact-check every one of their op-ed columns.

No, my beef is with David Brooks for not correcting his numbers. Janet Mertz contacted him and the Times to report that his published numbers were in error, and I also contacted Brooks (both directly and through an intermediary). But no correction has appeared.

The funny thing is, yesterday’s column would’ve been the perfect place for Brooks to make his correction. He could’ve just added a paragraph such as the following:

One trouble with numbers is they can be spuriously confusing. For example, I myself [Brooks] was misled just a couple months ago when reporting a claim by magazine publisher Ron Unz about a so-called “collapse of Jewish achievement.” In my column, I uncritically presented Unz’s claim that the percentage of top scorers in the American high school math Olympiad team had declined to 2.5%. The actual percentage is over 12%, as I have learned from Prof. Janet Mertz of the University of Wisconsin, who has published peer-reviewed articles on the topic of high-end mathematics achievement. The actual data show evidence not of a dramatic “collapse” but rather a gradual decline, explainable by increased competition for a fixed number of slots on the Olympiad team, together with demographic changes.

OK, that’s not so pithy. I’m sure Brooks and his editor could do better. My point is that, if Brooks wants to talk about the limitations of data, he could start with himself.

The problem with Brooks, as with many “quals,” is not that he operates on a purely qualitative level but rather that he does use data, he just doesn’t distinguish between good and bad data. He doesn’t seem to care.

To put it another way, if Brooks wants to claim, of American Jews, that “the fanatical generations of immigrant strivers have been replaced by a more comfortable generation of preprofessionals,” then, hey, go for it. The problem comes in when he supports this claim with bad data.

Just to be clear, I’m not trying to slam Brooks here. I have a beef with Brooks because I think he can do better. I think he’s right that overreliance on statistics can mislead, and I think he could make this point even better by recognizing how this problem has affected his own work.

As the great Bill James once said, the alternative to “good statistics” is not “no statistics,” it’s “bad statistics.”

P.S. I added additional sentences to the inline Brooks quote above in order to provide more context, to clarify that Brooks was presenting the numbers as coming from a particular outside source. It was not right for me to say he was presenting these numbers “unquestioningly,” as he does express some concerns. Brooks expressed some potential criticism of Unz’s conclusions but not of Unz’s numbers. The reason I still think a New York Times correction is in order is that the numbers appear to me to be presented as facts rather than as Unz’s claims. In any case, now that this 2.5% has been refuted, I think it makes sense to correct it. And, as noted above, I think such a correction is in keeping with Brooks’s larger message, which I support, that numbers can be misleading when we don’t know where they are coming from.

[Cross-posted at The Monkey Cage]

Andrew Gelman

Andrew Gelman is a professor of statistics and political science and director of the Applied Statistics Center at Columbia University.