One year in advance of the 2012 election, New York Times blogger Nate Silver published a presidential forecasting model. The model includes measures of presidential approval and economic performance — standard variables in election forecasting models — as well as a novel measure of challenger ideology that appears to have substantial effects. Based on this model, Silver estimates that “The difference between [Mitt] Romney and [Rick] Perry amounts to about 4 percentage points” — a huge predicted effect that could easily swing the outcome of the election. Consider, for instance, Seth Masket’s graphic illustrating how the predicted probability of a Republican win depends heavily on the estimated ideology of the GOP candidate:

Though candidate positioning is likely to influence presidential election outcomes, there are important reasons to question whether the challenger ideology effect Silver identifies is so powerful.

First, when the economy is growing and presidential approval is high, strong moderate candidates may be scared off from entering the race, leaving only ideologues. A similar effect has been shown when one party has held the presidency for a long period of time. When this happens, the opposition tends to perform better due to the perception that is “time for a change”, and opposition parties are likely to nominate more moderate candidates in the hopes of regaining control of the White House at the expense of ideological purity.

Second, the estimates of challenger ideology that Silver uses are primarily drawn from voter perceptions of the candidates. However, these perceptions are driven by the content of the campaign, which is itself shaped by the economic context. Candidates who appear extreme in one era may seem less so in the next (consider the changing perceptions of Ronald Reagan between 1976 and 1980, for instance). For all of these reasons, Silver’s estimates of the effects of challenger ideology and election outcomes are likely to be significantly exaggerated.

In addition, as we demonstrate below, Silver’s model does not substantially improve the accuracy of presidential election forecasts, which casts further doubt on the importance of candidate ideology (see also Alan Abramowitz).

Silver’s model includes three predictor variables – presidential approval one year in advance of the election, election year GDP growth, and an estimate of challenger extremism (i.e., the extremism of the candidate of the party that doesn’t control the presidency at the time of the election). Using just three variables to predict the outcome of a presidential election may seem simplistic, but in forecasting simplicity is a virtue. With only 17 elections since 1944 to work with, including many indicators in a statistical model is likely to result in the identification of factors that are highly correlated with the election results we have already observed, but which do a horrible job in predicting the future.

For related reasons, Silver criticizes other forecasting models that use relatively obscure economic variables such as growth in real per-capita disposable income:

The government tracks literally 39,000 economic indicators each year…. When you have this much data to sort through but only 17 elections since 1944 to test them upon, some indicators will perform superficially better based on chance alone…. The advantage of looking at G.D.P. is that it represents the broadest overall evaluation of economic activity in the United States.

As Silver notes, there are legitimate reasons to worry that the search for statistically significant predictors will result in identifying indicators that perform well by “chance alone” (an extreme example: Washington Redskins home wins). Using such indicators can cause us to be overconfident in our statistical models (what statisticians call overfitting the data) and tends to make accurately predicting future events — like next year’s election — very difficult or impossible.

As you might expect, scholars have spilled a lot of ink debating the best forecasting indicators for outcomes ranging from the paths of hurricanes to stock prices. But rather than have a philosophical debate, we can evaluate this concern empirically to determine the extent to which specific forecasting models can successfully predict election outcomes beyond the range of the data used to estimate them. In particular, if models are spuriously identifying chance relationships, then they should perform relatively poorly after the point at which they was first published.

To do so, we began with Silver’s source data, which was compiled from the New York Times website and generously shared with us by Harry Enten.* Using a standard linear regression model, we almost precisely replicated the coefficients in the Javascript code for the interactive calculator on the Times website.

As a starting point for evaluating Silver’s model, we first compare it with the Douglas Hibbs’ “Bread and Peace” model, which uses the real per-capita disposable income variable described above. Silver has previously criticized the Hibbs model as performing relatively poorly outside the range of the years in which the model was first estimated (1952-1988). Here is the key graphic in question:

However, when we estimate both models using the same data as Silver’s graph above (1952-1988) and predict the outcome for the 1992-2008 period in terms of the share of the two-party vote received by the party of the president (the standard outcome variable in the literature), we find that Hibbs’s model generally performs better than Silver’s (Stata data and do files available upon request):

Of course, these are not the only available models for comparison. Indeed, political scientists and economists have estimated dozens of other presidential forecasting models over the past twenty years. For example, PS: Political Science and Politics published a pre-election symposium in 2008 that included presidential election forecasts from numerous scholars (see also here, here, here, or here). Most such models make predictions based on economic conditions and/or public opinion, but they typically do not include a measures of candidate ideology.

While it is fun to compare the performance of these forecasts, we should be clear that there is no one “correct” model. Rather than relying on a single model, we can instead combine the forecasts of numerous models using a technique called ensemble Bayesian model averaging, which creates a composite forecast weighted by the predictive performance of the component models. This approach was developed for combining weather forecasting models (see here) and has been applied to political outcomes in a paper (PDF) by Montgomery, Ward, and Hollenbach.

The figure below, which uses the methodology described by MWH to create Figure 3 in their paper, compares one-step-ahead predictions from Silver’s model, the most recent versions of six prominent models in the literature (Campbell’s “Trial-Heat and Economy Model,” Abramowitz’s “Time-for-Change Model,” Hibbs’s “Bread and Peace Model,” Fair’s presidential vote-share model, Lewis-Beck and Tien’s “Jobs Model Forecast,” and Erikson and Wlezien’s “Leading Economic Indicators and Poll” forecast**), and a composite forecast created using the ensemble technique. The forecast of each model is plotted with its 67% and 90% confidence intervals against the eventual outcome, which is represented with a dotted line:

Silver’s model performs well in some elections, but it is very inaccurate in comparison to its peers in 1992 and 2008. With those exceptions, it does not appear to differ from other models dramatically, though its overall performance is worse on average than the comparison models. The ensemble forecast appears to perform quite well, producing predictions that are relatively close to the actual outcome.

The figure below, which is adapted from Table 4 in MWH, compares the accuracy of the models more precisely using mean average error — an intuitive (though imperfect) metric for comparing forecasting models that measures the average amount by which they mispredict the final outcome.

We can see that all of the models are relatively accurate on average. They mispredict the vote share for the incumbent party by an average of 1.7 to 3.4 percentage points — an impressive record given that most models include only two or three variables. By this metric, Silver’s forecasts are the least accurate in the group and the ensemble forecast is the most accurate. (See MWH for a discussion of the extent to which these models appropriately express uncertainty about their predictions.) Since some of these models — and implicitly the ensemble model that relies on them — have been around for twenty years, this result should not be especially surprising. The literature on presidential forecasting is relatively mature.

At this point, we should note two important but wonky caveats. First, we follow MWH in using the most recent version of each of the forecasting models from political science. In some cases, model specifications may have been revised to account for previous results, which could artificially improve their performance in one-step-ahead prediction tasks (see footnote 14 in MWH). Second, these models differ in the extent to which we would even expect them to make accurate forecasts far in advance of a presidential election. For instance, Campbell’s model includes the approval rating on the Labor Day before the election. Silver’s model, on the other hand, takes on the more ambitious challenge of using approval data from a year before the election (though it relies on GDP growth during the election year, which is of course not available in advance).

Ultimately, almost every analyst agrees at this point that it is still too soon to say with much confidence whether President Obama will win in November. In particular, there is still too much uncertainty about the state of the economy next year. However, both theory and data suggest that the conservatism of his opponent is likely to matter less than Silver’s model suggests.

Jacob Montgomery contributed to this piece.

[Cross-posted at Brendan-Nyhan.com]