Powden, then Sen. Jim Jeffords’s staff director on the Senate Health, Education, Labor and Pensions Committee, was part of a bipartisan group of congressional staffers readying the Bush plan for a vote by the full Senate. The House and Senate committees had already signed off on the plan and its core proposal to test every third- through eighth-grader in the public schools and then sanction their schools (and school systems and states) if their test scores failed to improve. It was a plan on which George W. Bush had run for president—his single best claim to being a “compassionate conservative.” The public was demanding school reform, and it was badly needed.

But Powden, an 18-year Hill veteran, doubted the plan’s ability to gauge schools’ performance correctly—the linchpin of the Bush plan. So, as an experiment, he applied the Bush model retroactively to test scores in Connecticut, North Carolina, and Texas, three states that had improved their scores significantly in recent years. He discovered that the vast majority of the states’ schools—schools with established track records in raising student achievement—would be labeled failures under the Bush system. When Powden presented his findings to White House officials and his fellow Senate staffers at a late-night meeting at the Dirksen Senate Office Building “there was stunned silence,” recalls a participant. Powden, they realized, had just turned the cornerstone of their school reform package into dust.

White House education advisor Sandy Kress, the testing plan’s author, scrambled to draft a new plan that the Senate passed in June. But independent testing experts who had read the plan’s fine print pointed out that the new plan was no less flawed than the original. Kress himself would later call the new plan “Rube Goldbergesque.”

The White House and the staff of a joint House and Senate committee struggled throughout the summer to make the testing plan work. Bush political advisor Karl Rove was pressing for a bipartisan Rose Garden signing ceremony in September. But by the time Congress returned to work after Labor Day, the bill’s authors still hadn’t fixed the plan. Instead, new flaws had emerged.

Then came the events of September 11. Suddenly, nearly the entire domestic agendas of both parties—from Social Security reform to prescription drug coverage—were shelved. A nation at war, which could no longer afford partisan squabbling, submitted instead to a hasty bipartisanship. But, in late September, with both parties convinced that holding educators more accountable for their students’ performance would strengthen public education, the president ordered a full-court press to get the bill passed by year’s end.

So, rather than openly debate the bill’s many defects, Congress is now under intense pressure to pass something—anything—fast. And the likely result will be legislation that hurts the nation’s students more than it helps them, promotes lower rather than higher standards; misleads the public about school performance; pushes top teachers out of schools where they are most needed; and drives down the level of instruction in many classrooms.

There is a way out of this mess, however. The White House and Congress need to take a deep breath, grab a fresh sheet of paper, and sketch out a new accountability plan built around the one element many insiders privately admit is missing: a national test of reading and math. Such a test would have been politically impossible to pass just weeks ago. But the political winds have shifted so dramatically that if the president were to seize the moment, he could very likely make it happen.

If he doesn’t, and Congress passes a big, badly designed federal testing and sanctions system, it could cripple the entire standards movement in public education—a movement that has been building momentum and garnering results at state and local levels since 1989. That year, Charlottesville, Virginia, hosted a summit between President George H. W. Bush and the nation’s governors, led by then-Arkansas Governor Bill Clinton. Bush and the governors bucked the nation’s long-standing trend of local control in public schooling and established as a national “goal” that students in grades four, eight, and 12 demonstrate “competency” in challenging subjects such as English and math. Since then, ratcheting up standards has become a cornerstone of school reform at the state level. And greater accountability has been a principal strategy for reformers’ pushing state and local educators toward the new, higher standards.

The revised Bush education reform bill rightly builds on this new paradigm. Measuring student performance and holding schools accountable for results is indeed a key piece of the school reform puzzle, especially for disadvantaged children. And the federal government, which spends $18 billion annually on the nation’s public schools, has a right to demand results for its investment.

Talk to education-testing experts and a pretty clear consensus emerges about what a strong national accountability system would look like. It would include tests that gauge students’ grasp of higher-level skills and knowledge, not just “the basics.” The tests would require students to write, solve problems, and perform experiments, rather than over-relying on multiple-choice questions that often give a false sense of students’ abilities. (To illustrate the point: In one recent study, 80 percent of a national sample of eighth-graders could correctly identify the product of 9 x 9 when supplied with several multiple-choice answers, but only 40 percent were able to answer a word problem asking them to calculate the square footage of a 9’x 9′ room.) The tests’ content would be agreed upon nationally. And the tests themselves would be national: Each public school student nationwide would take the tests for his or her grade level yearly, permitting year-to-year tracking of individual students’ progress. Schools would be judged both by how many of their students achieve standards and by how much their students’ performance improves over time.

Had such tests been the basis of Bush’s plan, Kress and his congressional counterparts wouldn’t be wrestling, as they are now, with a Gordian knot of methodological problems. But when the Bush legislation was first introduced in January, everyone knew national tests were a political non-starter. They are strongly opposed by both the right, which adamantly believes that education policy is a state function, and the left, whose civil rights groups see testing as discriminatory against minorities, who historically underperform on such tests. Even the president has often expressed strong and seemingly sincere philosophical opposition to a national test.

As a result, the Bush plan lets the states choose their own tests and set achievement standards, a strategy that practically guarantees mismatched tests, low standards, and scant hope for real accountability. Most of the flaws in the Bush testing plan stem from the simple fact that Congress and the White House want the benefits of national tests without actually having to mandate them.

Consider this: Both the House and Senate versions of the bill require that states track the year-to-year performance of every school. But in an effort to avoid being seen as dictating testing standards from Washington, the legislation also allows states to continue the common practice of giving different kinds of tests to students at different grade levels. Yet if scores from these different tests can’t be meaningfully compared to each other—and usually they can’t—then how can you accurately track a school’s year-to-year performance?

Similarly, there’s lots of encouraging language in both bills demanding that tests be aligned with “rigorous” state standards. But only the House version explicitly requires tests to measure student progress against state standards. The problem is that more than half the states that already test as much as the Bush plan requires use off-the-shelf brand tests, such as the Stanford 9, which aren’t based on defined academic standards. Instead, they compare student scores to a national sample of other students who take the test. Those other students might be well-educated or ignorant; they might know a lot of math, or very little—the tests don’t tell you one way or another. However, the incentive to administer such tests is “tremendous,” says Jennifer Vanick of Achieve Inc., a nonprofit created by governors and corporate leaders to push standards-based reform. The reason? They’re cheap. Tests like the Stanford 9 cost about $6 a student per test compared to upwards of $30 for high-quality tests that require writing and other tasks that push students beyond basic skills. But constructing a single national set of high-quality tests would be the least expensive alternative.

Tests like the Stanford 9 will encourage teachers to spend more classroom time teaching students how to score better. Teachers will “teach to the test” under standards-based accountability regimes, too, but with a better end result: They’ll help students score well on standards-based tests not by showing a few test-taking tricks but by helping them master the curriculum’s content.

Of course, the great question about standards is how rigorous they should be. Set them too high, and many good schools and students will be marked down as failures. Set them too low, and many under-performers will be allowed to skate by. Having the right standard level will be especially important under the Bush plan, because schools that fail to meet the standard could have their students sent to other public schools, their teachers reassigned or fired, or their doors closed.

So how do the House and Senate bills deal with standards? By punting the decision to the states. States would have the freedom to set the passing grades on the tests used to parcel out the plan’s rewards and sanctions. This is like having suspected criminals serve as their own judges and juries. What state facing federal sanctions is going to set standards high and increase its changes of getting whacked? “Inevitably, you would reward states with lower standards,” says Chester E. Finn, Jr., president of the Thomas B. Fordham Foundation, a conservative education think tank. What’s more, the Bush plan would encourage states that already have high standards to lower them. “The incentives,” says Finn, “are perverse.”

Another flaw in the Bush plan is its dependence upon the U.S. Department of Education to police the states’ testing efforts. Not long ago Republican administrations wanted to wipe out the department. But the Bush testing plan would require the department’s bureaucrats to ensure, among a host of other things, that states build strong tests and set high standards—something they’ve been trying to do with only modest success for years. Congress passed an accountability system far less sweeping than Bush’s in 1994. But since then only 17 states have put the required tests in place.

Many states, says Michael Cohen, an assistant secretary for elementary and secondary education in the Clinton administration, have largely ignored the prodding of federal bureaucrats, who lack the power to respond to the snubs. But the House and Senate testing plans give the education department authority to deliver only wrist slaps to recalcitrant states by reducing the federal money they receive to administer the testing requirements. “Who’s going to care if you cut the number of state bureaucrats?” asks Cohen.

Under a single national testing system, of course, it wouldn’t be necessary to cast the Department of Education in a role it’s not prepared to perform. Instead, a national test would be relatively easy to regulate. Out-and-out cheating would have to be policed, as it is with other national tests like the SAT. But states wouldn’t be able to flout federal regulations largely invisible to the press and the public; if they refused to give high-profile national tests, there’d be hell to pay in the court of public opinion.

The White House realizes that the states aren’t necessarily going to do the right thing under the Bush plan. And so, at the urging of the president’s advisors, the Senate bill would require that every year states give a second test—the National Assessment of Education Progress (NAEP)—to samples of their students to deter states from introducing watered-down tests and standards.

In theory, this is a clever idea. Congress established NAEP in the late 1960’s to identify national trends in student achievement in core subjects. It’s a strong test. It has high standards and isn’t overly dependent on multiple-choice questions. Requiring states to give students NAEP every year is a big psychological step toward national testing—especially because the idea was proposed by a conservative Republican president with the backing of leading Senate conservatives such as Judd Gregg and Jesse Helms. (Many House conservatives hate NAEP’s presence in the Bush testing plan. They demanded and won language in the House testing plan that lets states use tests other than NAEP, another big potential loophole.)

Unfortunately, the NAEP ploy won’t work, because the test can’t do what the White House wants it to do. It isn’t an exact enough measure of student achievement to make it meaningful in year-to-year calculations of schools’ eligibility for rewards and sanctions. “It’s simply not designed to be very precise,” explains Mark Musick, chairman of NAEP’s governing board. Finn, who sat on NAEP’s board in the late ’80s and ’90s, goes so far as to say that yearly NAEP scores are likely to contribute “exactly nothing” to the Bush accountability plan.

Yet another big defect of the legislation concerns the precise way in which student progress will be measured. Both the House and the Senate would require states to increase every year the percentage of students achieving “proficient”—passing—scores on their tests. Under this system of Adequate Yearly Progress (AYP), schools would have to get 100 percent of their third- through eighth-graders to the proficient level within 10 or 12 years. Rewards and sanctions would kick in when schools, school systems, and states hit or miss their AYP targets along the way.

The goal is a good one: a simple-to-grasp performance yardstick of the nation’s vast public school system that requires real progress every year. To ensure that no major group of students is left behind, the plans take the further step of requiring that scores be broken down by subgroups of students: African Americans, Hispanics, females, the disabled, and those from low-income homes. Schools would have to make the same level of test-score progress in each category, or face sanctions.

But there’s a serious flaw in the strategy: Test scores, even at the best schools, never rise year in, year out, as the House and Senate plans require. Rather, they fluctuate like stock prices, revealing trends only over the longer term. As a result, under either bill, vast numbers of schools and school systems would be rewarded or punished wrongly. This is why Mark Powden found that so many schools would be misidentified as failing under the Bush plan. And if Powden’s congressional colleagues were stunned by those flawed results, imagine how parents and educators will react when they discover their schools were wrongly labeled as bad.

Test scores bounce around year to year because of a host of factors that have little or nothing to do with school quality, such as high student turnover, new teachers, or a bad flu season. Last spring, researchers Tom Kane at UCLA and Douglas Staiger at Dartmouth reported that 70 percent of the year-to-year change in average test scores in North Carolina’s elementary schools is caused by such external factors rather than actual change in student performance. Richard Hill, executive director of the Center for Assessment, a nonprofit that advises states on testing systems, argues that “any system that relies on a single year’s growth to measure school performance is driven largely by random error.”

To make matters worse, year-to-year scores become even more unreliable when they are broken down by subgroups of students. That’s because the influence of external forces increases when sample size shrinks. Kane, Staiger, and colleague Jeffrey Geppert released an analysis in July revealing that random fluctuations in schools’ scores would cause 89 percent of North Carolina’s elementary schools to fail the House and Senate AYP standards in math. When the researchers included the congressional demands for subgroup score increases, 98 percent of North Carolina’s elementary schools failed the Senate’s AYP standard, and everyone failed the House’s.

The Bush plan for rewards and sanctions puts schools with large percentages of impoverished students at a particular disadvantage. Bush is right to encourage high standards for every student. Yet nearly four decades of research dating back to the landmark Coleman Report makes plain that such students, on average, don’t score as high as affluent students, regardless of how expertly they’re taught. What encourages the best teachers and principals in those schools to get out of bed in the morning is not some vision of perfection, but the belief that with hard work and determination they can make their schools better over time. It doesn’t make sense, as a result, for the Bush plan to sanction teachers making strong progress with disadvantaged students merely because their students don’t meet a state’s proficiency standard. Doing so would only drive good teachers out of bad schools.

There is, however, a way to fix the problem. The solution is to build into the Bush plan a system for tracking individual students’ achievement over several years and rewarding schools that produce higher-than-expected test scores given their students’ backgrounds. This so-called “value-added” strategy would require a large computer infrastructure to track students’ individual progress. But it would produce fairer and more exacting judgements of school quality, according to testing experts. To ensure that students aren’t subjected to what Bush calls the soft bigotry of low expectations, the Bush plan should use both value-added and absolute standards to judge schools. (For that matter, it should also use such non-testing indicators of school success as student and teacher attendance rates).

Sandy Kress, the president’s primary education advisor and author of the Bush accountability plan, knows the strengths of value-added school evaluations. It was a short-lived, value-added pilot project in the Dallas public schools in the mid-1980’s—one of the nation’s first—which attracted Kress to rewards-based accountability in the first place. He discovered the then-defunct plan when he headed a Dallas school-reform panel in the early 1990’s and resurrected it, making the evaluation system (which Dallas still uses) one of his commission’s key recommendations. “I’m very high on the value-added approach,” he told me recently. “It’s a lot

fairer to schools” than the Texas state system, which, like the Bush plan, uses the single-point-in-time, no-accounting-for-family background system of evaluating schools.

So why isn’t Kress pushing a value-added strategy on Capitol Hill? Because introducing such a strategy nationwide would amount to imposing a quasi-national testing system. Value-added evaluations require that students at every grade level take the same sorts of tests in every subject every year. “The country,” he says, “isn’t ready for that.” Congress, he argues, would refuse to dictate such testing requirements to states.

Taken together, the unintended consequences of the Bush testing plan are daunting, and its flaws could jeopardize the plan’s laudable goals, as well as those of the standards movement as a whole. In short, in its present form on Capitol Hill, the Bush plan is a gift to anti-testing, anti-accountability advocates.

The White House realizes this and has sought changes in the plan. Kress especially wants a system that identifies “only the absolute worst schools” for sanctions—the bottom 10 percent or so. He’s talking up a model with congressional staffers based on a system used in Texas, where schools were shielded from sanctions as long as a required percentage of their students in each subgroup scored at the proficient level each year, a bar that would be raised gradually, but not necessarily every year. Kress and others want to reduce the influence of such things as student mobility on AYP calculations by having states average two- or three-years’ worth of schools’ test scores.

But Kress’s 10 percent solution doesn’t address the Bush plan’s likely mislabeling of many schools and the damage that would do to its credibility. And averaging several years of scores doesn’t do much to push states away from easy tests and low expectations. What’s more, after heated protests by governors, Kress and Congress are also talking about giving states greater flexibility under the plan, which would only make matters worse.

The only true solution is national testing. It would not have been possible before, but it may be now, for several reasons. First, Congress and the White House have peeked into the abyss; they know they’re on the brink of creating an immensely complex and flawed testing system. Second, more and more conservative leaders acknowledge, at least privately, that the accountability movement is pushing the country towards national testing. Third, since September 11, the public’s stance toward national initiatives has changed profoundly. According to a Washington Post poll, 60 percent of the public now trusts the federal government to do the right thing, up from 30 percent in April 2000.

Consequently, Bush suddenly has the standing to push national testing. If he were to say to the Congress, “Take six months and build a credible national testing system,” it might well happen.

Were he to do so, he wouldn’t be in unknown territory. Back in the early ’90s, Finn, Secretary of Education Lamar Alexander, and other administration officials persuaded George W.’s father to take the unprecedented step of making national testing part of his school-reform agenda. States and localities weren’t moving on the Charlottesville goals, Finn recalls, and he and the others wanted to push them to act. They pushed for voluntary national tests. A decade later, it’s clear such tests should be mandatory.

Thomas Toch is a guest scholar at the Brookings Institution’s Brown Center on Education Policy.