In the WSJ, Bill Gates writes about the power of data collection and analysis. He writes; “In the past year, I have been struck by how important measurement is to improving the human condition. You can achieve incredible progress if you set a clear goal and find a measure that will drive progress toward that goal …”

I have been told that before Gates dropped out of Harvard that he audited first year Ph.D. econ classes there. If he takes his goal seriously, he will need to grapple with some frontier research questions. All micro econ students are taught the production function. In the typical boring example, a firm hires labor and capital to produce pizza. The professor stands at the board and writes; pizza = f(L,K) and talks about the demand for inputs where f(L,K) is the production function. What makes this boring is that the firm knows f(L,K) and inputs are homogeneous. All workers (L) and all robots (K) are identical.

But, diversity and uncertainty are the spice of life and of modern economics! For a taste of this, read Sherwin Rosen’s AEA Presidential Address. What Gates’ article is really about is how from observing some noisy measures of output, and inputs do we infer what the production function f() is? Gates doesn’t care about “pizza production”. He cares about healthy kid formation and human capital formation but on some level it is the same question. Consider the following example:

Suppose that we want to rank doctors with respect to their “value added” of saving patients’ lives. The research nerd observes whether a given patient survives, and observes some coarse observables such as age, zip code, ethnicity. The researcher also observes which doctor was assigned to the patient. Suppose the researcher assumes that doctors are randomly assigned to patients while the truth is that the best doctors are assigned to the sickest patients. Note the asymmetry of information here. The hospital recognizes the diversity of patient types and doctor types but the naive statistician does not. Once the data nerd crunches the data, he will falsely conclude that the best doctors are the worst doctors because on the performance criteria (dead patients), they will have a high share. To explicitly address this nasty self selection on unobserved attributes challenge requires an economic model of how doctors and patients differ and how doctors are assigned to patients. What is the data generating process? If this interests you, you should follow the work of Jim Heckman. You can see that he is well cited.

[Cross-posted at The Reality-based Community]