Nicholas Kristoff wrote yesterday about the “landmark” new teacher value-added study from Raj Chetty, John Friedman and Jonah Rockoff. It’s worth being clear about why the study has garnered so much attention. It’s not because it shows that teachers matter. Everyone knew or believed that already. It’s because it shows that teachers vary in how much they matter. And, for the first time, it takes the value-added debate out of the standardized testing box.

Value-added studies have existed since the 1990s, and every one of them reaches the same conclusion: teacher effectiveness varies a lot, and much of the variance is uncorrelated with things like experience, credentials, and other characteristics that we commonly use to manage the teacher workforce. Teachers with high value-added scores in one year are also more likely to have high-value added scores in future years.

But all of that is premised on the standardized tests used to calculate the value-added scores. If you believe standardized tests are worthless or highly flawed or deeply inadequate or even troublingly limited in accuracy and scope–and many reasonable people believe these things–than you could dismiss or downplay value-added measures of teacher effectiveness, by definition. Some well-known observers have gone so far as to say “value-added assessment should not be used at all. Never.”

But now the CFR study says that teachers who are unusually good at helping students score high on standardized tests today aren’t just unusually good at helping students score high on standardized tests tomorrow. They also have an unusual effect on the likelihood of students going to college, going to a good college, earning a good living, living in a nice place, and saving for retirement. In other words, whatever the limitations of standardized tests may be, test-based value-added scores do, in fact, provide valuable information about the things most people care most about.

The study is also a testament to the value of information systems. It’s based on a dataset containing millions of observations of students, teachers and scores over twenty years. This could not have happened without the advent of education data systems–often criticized as spending “not in the classroom”–and public policies mandating annual testing in reading math. That’s why the study is full of findings at a p<.01 level of significance, or better, i.e. “beyond a statistical doubt.” The paper also niftily gets at some technical but important concerns about value-added measures being potentially biased by unobservable factors and non-random classroom assignment. They found that “the bias due to selection on unobservables turns out to be negligible in practice.”

It’s also striking that the effects on college, earnings, and employment for people in their early-to-mid ’20s all sprang from teachers they had had in grades 3-8, many years earlier. Great teachers really do work wonders.

This leaves skeptics of value-added measures in a tricky spot. The “it’s all just standardized tests” position has been badly weakened and some of the most compelling methodological concerns have been addressed. Matthew DiCarlo at Shankerblog provides a representative response:

What this paper shows – using an extremely detailed dataset and sophisticated, thoroughly-documented methods – is that teachers matter, perhaps in ways that some didn’t realize. What it does not show is how to measure and improve teacher quality, which are still open questions.

In other words, yes, value-added measures may in fact tell us important things. We just shouldn’t act on that information in any way that matters.

This goes right to the heart of the most controversial part of the paper, where the authors use their findings to estimate the impact of firing bad teachers. Here’s what they found:

Replacing a teacher in the bottom 5% with an average teacher generates earnings gains of $9,422 per student, or $267,000 for a class of average size…the gains from deselecting low quality teachers on the basis of very few years of data are much smaller than the maximum attainable gain of $267,000 because of the noise of VA estimates. With one year of data, the gains are about half as large ($135,000)…with three years of data, one can achieve more than 70% of the maximum impact ($190,000). Waiting for three more years would increase the gain by $30,000 but has an expected cost of 3 X $190,00 = $570,000. The marginal gains from obtaining one more year of data are outweighed by the expected cost of having a low VA teacher on the staff even after the first year…Because VA estimates are noisy, there could be substantial gains from using other signals of quality to complement VA estimates, such as principal evaluations or other subjective measures based on classroom evaluations.

In their bloodless economic language, the authors are describing an extremely difficult set of operational and policy choices concerning new teachers, one characterized by competing interests and imperfect information. It’s not a problem that can be solved, only managed as well as possible.

We have a relatively open labor market for K-12 teachers in this country. If you can get a bachelor’s degree, you can probably get a teaching job somewhere. Teaching is a very complicated thing to do well, and it seems pretty clear that, like playing quarterback in the NFL, the extent of someone’s talents and aptitude for teaching can’t be fully or even substantially known until they actually get into the classroom.

So when people enter the classroom and do poorly, we have a complicated set of interests to weigh and decisions to make. Teachers have an interest in being treated fairly. Student have an interest in being taught well. If we fire under-performing teachers after a year, we’re more likely to make a mistake. So we can wait for more information–but at a cost to students in the form of bad instruction. The authors seem to suggest that the cost/benefit tipping point is around three years–but maybe not. $135,000 is, as they say, not nothing. And if VA estimates are supplemented by “other signals of quality,” as they surely will be, the accuracy of decision-making would presumably rise.

The least defensible response to the study, advanced by Sherman Dorn and others, is that the entire discussion of firing bad teachers represents an inappropriate extrapolation of the findings. (In Dorn’s term, “(s)extrapolation,” or “the research equivalent of the Kardashians.”) Academics complain all the time that policy is insufficiently informed by evidence, and as a general proposition that’s true. But these complaints are themselves often informed by a vague or naive view of how standards of evidence properly translate to policy choices. Students will be in school, tomorrow. They can be taught by the teacher they had today, or they can be taught by someone else. That choice is unavoidable. And going with “the teacher they had today” is no less a choice than going with someone else. For CFR to conclude from their research that present policies ought to be more strongly weighted toward the possibility of going with someone else isn’t the academic equivalent of staging a fake wedding for Entertainment Tonight and pocketing the profits, it’s a case of academic researchers fulfilling their responsibility to make their findings meaningful on behalf of society.

Addendum: Craig Jerald tweets agreement with the caveat that the deselection policy frame is too narrow. That’s fair. But whereas “deselection” sounds like a mealy-mouthed euphemism for “handing someone their walking papers,” it can just as easily be interpreted more broadly. Non-selection is a kind of deselection. That can happen a lot of different ways, through admission standards at teacher prep programs or licensure policy or H.R. systems. If those processes can reduce the number of ineffective teachers, so much the better–as long as, crucially, they don’t also freeze out potential effective teachers. The authors are mostly just trying to illustrate the magnitude of the choices being made–or not made.

[Cross-posted at The Quick & the Ed]