As Barack Obama enters his second term, some already are wishing it over. Not just the House Republicans, either – but those who want to locate Obama’s place in The Annals of Presidential Greatness, deciding whether he is the 7th, 17th, or 37th best president of all time.

In a new 538 post around this topic, Nate Silver spends a lot of energy proving the unsurprising: that presidents who serve longer, and win larger re-electoral margins, are better regarded by history—or at least by historians. I will forgive his use of Franklin Pierce—Bowdoin ‘24!—as the poster child of loser one term presidents. My beef is really with the larger enterprise of the rankings themselves, which I would argue fall into the irresistible-but-mostly-useless category of political science data.

Silver’s raw material is—as it has to be, really—the iterated surveys of presidential performance taken by Arthur Schlesinger Sr. and Jr. (in 1948, 1962, and 1996), as added to by others who sought to redress what they saw as the Schlesinger samples of respondents’ Democratic bias (as with the Wall Street Journal and Federalist Society surveys, as well as Alvin Felzenberg’s more idiosyncratic take) and the wider recent efforts of C-SPAN. (The top ten from Schlesinger’s ‘96 survey are in the montage above.)

But a number of issues arise. It’s worth noting that presidential incumbents are, almost to a person, outliers – located on the far positive reaches of any scale measuring American political aptitude and skill. This creates a problematic bell curve when they are isolated into a single population. We might see the top and bottom as clearly differentiable, but is there any meaningful gap in performance between someone ranked #15 and someone ranked #25 in such a small set of observations?

Further, the rankings themselves change over time. Harry Truman is the classic case: a president widely unloved by the electorate at the close of his term but one whose stock has risen steadily since. (It doesn’t hurt to have David McCullough write a book about you.) Dwight Eisenhower, too, has seen his rankings rise over time – in his case, to match his extant public popularity—as a fuller internal record of his presidency became available. Richard Neustadt, for one, downplayed Eisenhower’s executive skills, but early judgments can mislead, even within a single individual’s tenure in office. Indeed, the assessment of the George W. Bush legacy immediately after his reelection – as observers lauded the Republican majority realignment apparently achieved – contrasts rather sharply with the picture four years later.

The deeper problems with rankings, though, lie in the nature of the enterprise itself. They are inherent in the difficulty in choosing the right standards for measurement and in assigning credit or blame to isolated individuals in the separated system of American governance. As Donald Rumsfeld famously briefed, “Stuff happens” – but the fact of its happening during a president’s term doesn’t mean the president made it happen. There is a wide range of governmental outputs, and outcomes, not all of them attributable to presidential action – even if he did in fact prefer that outcome. What outcomes has Barack Obama personally effected? Which of those should receive more weight in our retrospective assessment? Can we give credit for a good decision, even if it had a bad outcome?

And: is a decision or outcome “bad” if it conflicts with a later code of morality or with a latter-day judge’s ideological preferences? The institution of slavery and the treatment of Native Americans, for two, loom rather nastily over the earlier presidents. The comparative salience of even less normative issues, as abetted by most raters’ imperfect knowledge of American history – a quick read of past State of the Union messages suggests an array of basically-forgotten crises—also makes a difference for our accuracy in judging the past. Along these lines (cf. Stephen Skowronek’s work on “political time”) we must also recall that different presidents enter office under distinct political circumstancesthat expand or constrict the options available for presidential achievement. Bill Clinton moaned to his advisers that no one could be a great president without a national emergency, a thought channeled by Obama chief of staff Rahm Emanuel a decade later when he urged his boss to “never waste a crisis.”

How does all this get played into the mechanics of the rankings? The early surveys ranked presidents as “great,” “near-great,” etc., on the grounds that – like pornography – you know greatness when you see it. C-SPAN, in its 2009 rankings, asked sixty-five scholars of the presidency to grade the forty-two individuals to have served as president at that point on no fewer than ten “attributes of leadership.” These cut across a wide range of areas encompassing not only skills and policy arenas but perceptions: public persuasion, crisis leadership, economic management, moral authority, international relations, administrative skills, relations with Congress, vision/agenda setting, “pursuing equal justice for all,” and broad “performance within context of the times.” This makes some sense in that it invites us to consider the multiple skills required for success in the job and the multiple dimensions to any presidency. (How do we feel about personal behavior, for instance, as opposed to the content of that person’s public policy?) Providing a series of categories allows presidential raters to make some distinctions along these lines, allowing us to praise Lyndon Johnson’s commitment to voting rights while decrying his decision to escalate in Vietnam.

Yet when translated into scores, those assessments across a long list of categories are compacted, equally weighted, into a single score—which assumes that each category is of equal importance, both across a presidency, and across time. Does “performance within the context of the times” have the same value as “administrative skills” or “relations with Congress” (indeed, does the former subsume the others in any case?) Do “economic management” and “moral authority” count simply as two equivalent questions on a multi-part exam? Should a bad television presence cancel out a good nuclear crisis? It’s a mess.

In the end, then, Silver is right to wish there was an historical equivalent of the VORP (value over replacement player) statistic in baseball. What we want to know is the “value added” of having a specific person in the Oval Office, controlling for context? We can put the question in intuitive terms without casting too far back in history: comparing Gore v. Bush and the aftermath of 9/11, certain parts of history would have gone along similar lines, but others would likely have diverged significantly. The variance represents (some of) the Bush difference. Still, we’re stuck with counterfactuals when slugging percentage and fielding range would be much more satisfying things to know.

All this should encourage us to seek nuance, to be as cautious about our causal claims and evidence as we would be in a more obviously empirical exercise. I’m not hopeful that the second term punditry will toe that line. But in the meantime, Marc Landy and Sidney Milkis have a short and sweet assessment of greatness: whether the president transforms how Americans view their government, winning a “struggle for its constitutional soul.” You know that, presumably, when you feel it.

[Cross-posted at The Monkey Cage]