DIAPERS AND BEER….John Quiggin has a post today about a subject that relates to marketing, economics, and statistical analysis, all of which are favorites subjects of mine. What’s more, the context is one of my all-time favorite quandries, so unless the intersection of these three topics strikes you as only slightly less tedious than filling out a 1040 form, read on. And yes, if you make it to the end I do have a point to make, one that perhaps John will respond to.

Here’s the background: one of the things that statisticians do is to try and find correlations. A famous example from marketing, for example, is that people who buy diapers also tend to buy beer. One of the problems with correlation hunting, however, is that they are mostly based on surveying a small number of people and hoping that they represent the entire population. Unfortunately, every once in a while you’ll get a correlation by chance ? your sample just happened to include a lot of alcoholics, for example.

Here’s where the fun stuff starts. We marketing folks just adore analyzing what people buy (loyalty programs at supermarkets combined with computerized bar code readers make this pretty easy), and we use this analysis to, um, better serve your needs. Basically, we take enormous masses of data and sift through it until we come up with some correlations. Bingo! People who buy diapers also buy beer! So let’s put a beer display on the diaper aisle.

As John points out, however, there’s a problem. If you take huge masses of data, you’re bound to find some correlations just by chance, so the whole enterprise seems like it’s built on straw. By the normal standards of statistical analysis, you’ll find correlations 5% of the time even in random data, so if you look at an enormous data set with a thousand different pairs of data you’ll find about 50 strong correlations just by chance. So what’s the point?

Well, first off there are some pretty sophisticated statistical tricks you can do with the data to make it more reliable. And, as John will no doubt be jealous to hear, he’s right: we marketing folks have pretty sizable budgets and can afford to run multiple surveys (or buy new data sets) if we find something that looks interesting.

But even aside from this, there’s a more fundamental question at hand, and it’s the point of this whole essay: is a correlation deduced from a huge multivariate analysis really less reliable than one deduced from a focused study? The argument seems to me to be this: if you have a hypothesis and test it, and you find a correlation, that’s good. But if you don’t have a hypothesis, and you find a correlation, then it’s probably just by chance.

But it’s not. The numbers don’t care whether you have a hypothesis or not, and in both cases there’s a 5% chance that the correlation is due to chance. In both cases you will have to reproduce the results independently if you want to increase your certainty.

Is this a trivial point? I don’t think so, because I think it points to a serious flaw in a lot of statistical analyses: the feeling that if you test a specific hypothesis and find a strong correlation, it’s probably real. Oh sure, you will make the usual disclaimers about 95% confidence intervals, but the reality is that the results get treated seriously.

I’m not sure they should be. Or rather, I’m not sure they should be treated any differently than the data mining techniques that produce masses of correlations. I suspect that the disillusionment among economists (and others) with data mining is real, but mostly because it punches you in the nose with the fact that correlations are often just artifacts of chance. The same is true of focused studies, but because these correlations back up a claim we wish to make, we mentally discount the possibility of random error.

This is wrong. Numbers are numbers, and no matter where they come from they should be treated with the same respect ? or lack thereof. To suggest otherwise, I think, is merely to admit that your conclusions are based not just on the numbers themselves, but also on some previous belief ? a Bayesian argument that we will leave for another day.

POSTSCRIPT: In case you’ve ever wondered, data mining is the real reason behind supermarket loyalty programs. Oh, loyalty is part of the reason too, but the real payoff is that (a) it produces mountains of data that supermarkets can use to sell their products more efficiently, and (b) there are many eager buyers for the huge, real-time data sets that supermarket loyalty programs produce. But don’t think about this too much. It will just scare you.