When Caryn Voskuil first reported for duty at Washington, D.C.’s Charles Hart Middle School in 2009, she was a twenty-two-year-old first-time teacher from Green Bay, Wisconsin. With her head of long, fiery red hair, she looked so out of place in the predominantly black institution that one of her students asked, in apparent seriousness, whether she was a leprechaun. A struggling school in one of the capital’s poorest neighborhoods, Hart had a reputation for violence, mismanagement, and low test scores: in 2008, only 17 percent of its students could read at grade level. But in Voskuil’s three years at the school, Hart has made a modest turnaround: its test scores have begun inching up. For her part in that improvement, Voskuil—now a seasoned hand in the classroom—has been deemed a “highly effective teacher ” and rewarded a substantial bonus. What’s puzzling about Voskuil, and about American education today, is how conflicted she feels about what it means to be “highly effective.”

Consider two different weeks in the life of Voskuil’s classroom: early this March, she and her sixth-grade English students were making their way through a slim novel called Seedfolks. The book tells the story of a young Vietnamese immigrant girl in Cleveland who furtively plants six lima beans one day in a trash-strewn city lot. At first her neighbors spy on her with unalloyed suspicion. But once they realize what she’s doing—that she isn’t stashing drugs, money, or a weapon—they begin to feel responsible for keeping her seedlings alive. By the end of the book, the rat-infested lot has become a community garden.

“The value of the way I am teaching this week,” said Voskuil, “is that kids are falling in love with books. They’re making important connections between ideas and having this emotional connection with the narrators. They also get the satisfaction of starting and finishing an entire book, which is something many of them have never done.”

And yet whenever Voskuil spends time teaching a novel—or grammar and syntax, for that matter—she says she feels a vague twinge. “I have the sense that this is not a meaningful use of my time,” she says. “I have that sense because it’s not going to be on the test.”

The test in question—a standardized exam called the DC Comprehensive Assessment System (DC CAS), made up almost entirely of multiple-choice questions—was coming up in early April. And so the week after she taught Seedfolks, Voskuil’s teaching style underwent a dramatic change. She began by physically rearranging the classroom: the students who had scored just below the “proficient” level on earlier tests were brought front and center; the students who had scored “below basic” were clustered in a small group near the window; and the few students who were already on record as proficient lined the perimeter, their desks facing the wall so they could work quietly on their own. Like a politician who spends all of her time in swing states, Voskuil wanted to concentrate on the students who were on the cusp.

Working within a tight agenda, with five-minute intervals marked by a stopwatch, Voskuil began drilling the group. Together, the students read aloud an eleven-paragraph text called “Penguins Are Funny Birds.” Then they answered multiple-choice questions such as “According to the article, how do penguins ‘fly through the water’? A .) They use their flippers to swim. B.) They dive from cliffs into the sea. C.) They are moved by ocean currents. D.) They glide across the ice on their bellies.”

And why, Voskuil asked, is it important to read the italicized introduction that always accompanies such text passages? Because it’s a summary, the group responded.

Finally, Voskuil handed out her students’ scores from previous exams—eliciting reactions that ranged from apathy to sighs to celebration—and asked them to write down their “score goals” for the upcoming test. Remember, she told the sixth graders, “we are working toward a 3/3 on our DC CAS writing rubric.”

Voskuil hates teaching like this. It’s not that she fails to see the point, exactly. She knows that all this narrow drilling has, in fact, helped elevate her students’ scores (though they are still very low) in the three years since she has been teaching. And she recognizes the test’s value as a measure of some essential skills, and as a guarantor of her school’s and her own accountability. But the assessments confine and dumb down her teaching. “You are not even allowed to be a teacher when they are testing,” she says. “You are a drill sergeant.”

In other words, Caryn Voskuil hates what has come to be known pejoratively as “teaching to the test.” She’s not alone. This kind of instruction fundamentally degrades the whole project of teaching and learning, many believe. It inevitably subjugates higher-order thinking—the kind that comes with, say, learning about character development and narrative logic, stretching one’s vocabulary, and becoming familiar with common themes—to the coarse business of pattern recognition, mimicry, and rule following. This is a lament many Americans have come to take for granted in the years since the No Child Left Behind law (NCLB) was passed. Even reformers who believe in the broader project of standards and accountability seem to regard the matter of narrow-minded test prep as an embarrassing fly in the ointment.

Framing the problem of modern assessment this way makes perfect sense—until you consider that a couple of the most elite and highly regarded institutions of American secondary education involve a ton of what can only be described as teaching to the test. Advanced Placement and International Baccalaureate courses are essentially yearlong exercises in test prep. Yet you rarely hear anyone complain about them as such. Why?

The difference lies in the tests themselves, and the kind of preparation they demand. A run-of-the-mill standardized exam like the DC CAS is a test of “basic skills.” It asks students to do things like find the main idea of a text using a series of multiple-choice questions. An Advanced Placement test, by contrast, asks students to do things like analyze and interpret texts, construct logical explanations, and put facts in context, using a mix of multiple-choice, short-answer, and essay questions. All year, a student in AP American History is told what to expect on the final standardized exam: she knows she will need to become knowledgeable about a certain set of events spanning a certain period of time; she knows that memorizing a bunch of dates won’t really help her, and that being able to explain cause and effect will. It’s not that one model encourages “teaching to the test,” and the other doesn’t. It’s that one model causes shallow learning to crowd out the deep, and the other doesn’t.

In America, high-caliber tests like AP exams are usually the province of elite, high-achieving students on the college track. And so you might think that this two-tier assessment system is an inevitable result of inequality—that underprivileged kids wouldn’t have a prayer on these more demanding tests. Yet the industrialized nations that consistently outshine the United States on measures of educational achievement—countries like Singapore and Australia—have used such assessments for students across the educational and socioeconomic spectrum for years. Although some are multiple-choice tests, most are made up of open-ended questions that demand extensive writing, analysis, and demonstration of sound reasoning—like AP tests. “There is no country with a consistent record of superior education performance that embraces multiple-choice, machine-scored tests to a degree remotely approaching our national obsession with this testing methodology,” says Marc Tucker, the president of the National Center on Education and the Economy and an expert on testing. “They recognize that the only way to find out if a student can write a competent twenty-page history research paper is to ask that the student write one.” In other words, the kind of knowledge you can measure with a multiple-choice test is ultimately not the kind of knowledge that matters very much.

For that reason, experts have for years pleaded for the U.S. to adopt the kinds of tests that mea -sure and advance higher-order skills for all students. You won’t be surprised to hear they ’ve been frustrated. Part of the reason why they haven’t gotten their way is economic. Viewed in a certain light, “basic skills tests” are in fact just “cheaply measurable skills” tests. According to Tucker, superior assessments cost up to three times more than a typical state accountability test. Quite simply, scoring essay questions and short answers is expensive.

The other big problem is that the American testing market is fragmented. If there were some unified, standard curriculum across states—like an AP course in “What You Need to Know by the End of Third Grade”—then states would be able to pool their resources to pay for a worthwhile test they could all share, and test makers would be able to set up economies of scale, bringing prices down. The rest of the industrialized world operates much like this: countries examine their students to see how well they have mastered a certain standard nation-al curriculum. For various political reasons, we do not have a standard national curriculum. And so we have tests like the DC CAS, which establish a de facto curriculum in schools like Hart—a curriculum of “basic skills.”

Having said all that, here’s some astonishing news: quietly, over the past few years, forty-five American states plus Washington, D.C., have been working to establish something called the common core standards in math and English. While not a unified national curriculum, the common core will lay down a set of high, unified standards—rubrics that define what students should be able to know and do by, say, the end of third grade. Those standards will be enough to defragment the American testing market. With them will come a set of completely new, interactive, computerized tests that promise to be much like what you’d find in Singapore or Australia or an AP classroom—exams that test higher-order thinking by asking students to show, in a variety of different ways, whether they have mastered a set of working concepts. If this sounds like the kind of thing that might actually debut around the time we all drive electric cars, think again: these new assessments will start field testing next year, and are due to land in most American classrooms in 2014.

Most of what you know about school testing is about to change. That much is relatively certain. What remains to be seen is whether that change will be so dramatic that it overloads the current system.

American schoolchildren have been taking achievement tests for decades. In the 1950s, they used their well-sharpened number 2 pencils on some -thing called the Iowa Test of Basic Skills, which is still in use and is almost exclusively multiple choice. Tests of this period were of the low-stakes variety—indeed, they usually weren’t required at all—and they were “norm referenced,” meaning that students were rated as they compared to each other. (Nancy was in the 90th percentile, Susie in the 70th, and so on.) When the Russians launched the Sputnik satellite in 1957, U.S. schools came under pressure to up their game. The Elementary and Secondar y Education Act of 1965 (ESEA), the precursor to No Child Left Behind, focused federal funding on poor schools with low-achieving students. Meanwhile, there was a growing feeling among the public that all students should be striving for well-defined learning goals and be tested on that basis. Some of this demand for data on students’ achievement was met by the National Assessment of Educational Progress, popularly known as the Nation’s Report Card, which was first administered in 1969. The NAEP measured just a sampling of students, and it didn’t break out state results as it does now, but it marked a trend toward using tests to monitor performance.

Worries about the caliber of the nation’s schools cropped up again in the mid-1970s when the College Board revealed that average SAT scores of American students had been falling since the mid-1960s. The public started to demand proof that schools were doing their jobs, and the states responded by requiring students to take minimal competency tests in order to graduate from high school. These so-called exit exams set several important precedents: they started a trend toward more accountability; they led to more statewide testing; and they began a shift away from measuring students’ performance relative to each other and toward a new regime of measuring how well students individually met strict standards. In psychometric terms, norm-referenced tests were giving way to “criterion-referenced” tests.

Yet, for political reasons—mostly in the form of resistance by local school boards, teacher’s unions, and parents—the bar for passing these exit exams was almost universally low. According to Eric Hanushek, a senior fellow at the Hoover Institution of Stanford University, no state before 1990 administered an exit exam that even reached the ninth-grade level.

The ineffectiveness of these tests became obvious in 1983 with the publication of A Nation at Risk, the landmark federal study that warned of a rising tide of mediocrity in the nation’s schools. A number of states responded to the report by pushing for higher standards and mandating tests. Then, in the 1990s, President Bill Clinton nudged the movement along further with legislation that gave grants to state and local governments to set new standards and create tests to measure how well students were meeting them. Most states took advantage of the grants, but the legislation provided no mechanisms to punish schools that failed to make progress. To the extent that there was accountability, it was unevenly adopted by the states.

That changed in 2001 when President George W. Bush signed the No Child Left Behind law, under which, for the first time, the federal government itself was demanding that school districts be held responsible for the performance of their students. The tests that gauged this performance measured minimum competency, and they required states to show that their stu -dents were making yearly progress toward the goal of becoming proficient. They also led to a fundamental change—many would say for the worse—in the relationship between testing and instruction. Whereas the original goal of achievement tests was to improve instruction by providing educators with useful information, says Daniel Koretz, an assessment expert with the Har vard Graduate School of Education, the new goal was to improve instruction by holding someone accountable for results. Koretz calls this shift “the single most important change in testing in the past half century.”

NCLB has had some successes. Because it requires states to break down data among racial and other demographic groups, it has identified significant achievement gaps, and in most states those gaps are narrowing. It deser ves credit for inducing broad gains in achievement in the key subjects of reading and math (even as it crowds out other subjects), and it has encouraged teachers to use data to shape instruction. But more than ten years after the law ’s passage, 50 percent of schools are not making the Adequate Yearly Progress required by the law.

The greatest drawback of NCLB, meanwhile, is the one that so unnerves Caryn Voskuil: the tests it spawned ask students to restate and recall facts rather than to analyze and interpret them. It turns out that this is largely a legacy problem: because the stakes of standardized tests in America before NCLB were historically very low, states had no interest in paying much for them. And when those stakes got higher and states did need measures of accountability, they simply used or replicated the cheap tests they already had. They did so, in part, because each state had its own standards, and thus needed its own tests. That fragmented demand, along with the need for lightning-fast scoring, led to a shortage of experts to build the tests, as well as downward pressure on the profit margins of testing companies. The troubles in the industry, according to Thomas Toch, a senior fellow with the Carnegie Foundation for the Advancement of Teaching, created a strong incentive for states and testing contractors to write tests that measure largely low-level skills.

When President Obama took office in 2009, he inherited all the flaws of NCLB and standardized testing. But just as frustration with the law was reaching its height, he was also handed an opportunity: the common core movement, an initiative of state governors and the heads of large school systems, was guiding the states toward uniform academic standards, thus solving one of the biggest obstacles to improving tests and raising achievement. Using the vast pool of money established by the 2009 federal stimulus package, Obama prodded the movement along. Specifically, he allocated $330 million for the states to design a cutting-edge, state-of-the-art, nationwide test. In laying out this challenge, the administration established the following guidelines: the tests should be aligned to the new high standards; they should measure deeper learning; they should be computerized; and they should be capable of being used to evaluate not just students but educators. Oh—and they would have to be up and running in classrooms by 2014.

The states banded together to embrace the challenge, and eventually they winnowed themselves into two pioneering R&D teams (with, alas, anesthetically long names): the Partnership for Assessment of Readiness for College and Careers (PARCC) and the Smarter Balanced Assessment Consortium (SBAC). Each team, or consortium, is producing its own tests, but their scoring systems will be comparable—as those of the ACT and the SAT are to each other—so there will essentially be one national benchmark of readiness for college and careers. Over the past several months, these networks of state assessment directors, teachers, college administrators, content experts, and psychometricians have been racking up frequent-flyer miles and phone minutes, hashing out all the intricacies of twenty-first-century assessment. It’s not exactly astronauts and rocket scientists in The Right Stuff—but their efforts may ultimately have as much or more of an impact on the country.

Because the consortia are still letting out contracts, they won’t have test prototypes until this summer. But already the outlines of the two projects are taking shape. Both groups are designing interactive computerized tests that will have far more essays and open-response questions, more practical math exercises, and more word problems than current models. They will both use more nonfiction and informational text in addition to literary text. Both also call for fast machine scoring. The groups have similar goals for the long term, but PARCC, whose assessments won’t even be fully computerized until 2016, is less ambitious and more practical in the short term.

Since they are required by NCLB, most of the tests offered by both consortia will be “summative,” meaning that they summarize the development of learners at a particular time. PARCC, which represents a collaboration between twenty-one different states, is focusing on these kinds of tests, which states will use to hold educators accountable and to judge students’ readiness for college. It will have two assessments, and in a big departure from current practice, one will include performance tasks, such as asking a student to analyze a text using evidence to support claims or having him apply math skills and processes to solve real-world problems. At the end of the year, it will combine these into one summative score. PARCC will also have a speaking and listening test graded by a teacher.

The SBAC, a collaboration between twenty-six different states, will also create summative tests, but it will also develop “formative” assessments—tests that are used to gauge student progress in midstream and help teachers make course corrections. A formative test, which takes place during a sequence of instruction, can consist of anything from calling on a student in class to giving him a math quiz or assigning him a lab report. In each case, the teacher uses the resulting information to adjust her instruction. Designers of the next-generation tests believe that some standardized assessments can be formative, as well.

One problem with today ’s standardized tests is that they are virtually useless when given to children who are not performing at grade level. The sixth-grade DC CAS, for instance, doesn’t tell Voskuil much about her many students who are barely reading at the third-grade level—it just says that they’re failing. (By the same token, many students who are way ahead of the curve simply register as “proficient.”)The SBAC addresses this challenge with a new kind of test: one whose questions change based on individual student performance. If the student does well, the questions get progressively harder; if he does poorly, they get easier. These so-called computer adaptive tests are more costly to create, but the beauty of them is that they can pinpoint where students really are in their abilities. The SBAC will also offer two optional interim assessments, which will ask students to perform such tasks as making an oral presentation or writing a long article. These exercises, which would take students one or two class periods to complete, will require students to use other materials or work with other people.

Another problem with most current testing regimes is that they consist almost entirely of big tests administered at the end of the year. By the time a teacher learns that her students were having trouble with double-digit multiplication, the kids are already off to summer camp. Thus the new common core system will include more frequent assessments, which will measure skills that have recently been taught, allowing teachers to make mid-course corrections. Assessment, says Margaret Heritage, a testing expert with the University of California, Los Angeles, “needs to be a moving picture, a video stream.”

While it is still too early to describe any of these common core tests in detail, some testing companies have developed prototypes using the same kind of interactive assessment models that the two R&D teams are talking about. One of these prototypes is being developed by the Educational Testing Service (ETS) and piloted in a number of schools. Watching a student use this prototype offers a more concrete glimpse of what the near future of testing might look like.

In a ninth-grade classroom in North Brunswick, New Jersey, a student logs on to a computer. As if viewing an online slide show, he clicks on an aerial photograph of drought-stricken Lake Meade. A pop-up box tells him that his task is to determine what water conservation measures are necessary. Photographs and sketches depict a spillway, a river, a dam, and the lake that the dam has created. Next to these illustrations is a sketch of a sink with a faucet and a stopper. The prototype then asks the student to draw analogies between the pictures—between the stopper and the dam, the faucet and the river, the sink and the lake.

After showing the capacity of the sink in gallons, the prototype asks the student to perform a number of calculations (onscreen and using a built-in calculator) that determine the water ’s f low rate and speed, then to plot them on graphs using the mouse and cursor. It even asks the student to explain some of his choices in writing: for instance, how can you tell from the graph that the slope is three gallons per minute? What is remarkable about this test—aside from the fact that all these calculations actually feed into a simulation of water flowing, just so, into the sink—is how much time it devotes to one subject. It goes deep, in other words, and presents the kind of problem a student might see real value in solving.

The ETS prototype’s writing exam gets at the same kind of deeper learning. On this test, students consider whether junk food should be sold in schools. They must do some quick research using materials supplied by the test, summarize and analyze arguments from those materials, and evaluate their logic. The test even does some teaching along the way, reminding students what the qualities of a good summary are and defining certain words as the cursor rolls over them. The test provides writing samples, such as letters to the editor written by a principal and a student. Do these samples display the qualities of a good summary? The test asks the student to explain why. Is the writer’s logic sound? The student must prove he knows the answer to that question too. Does certain evidence support or weaken a position? The test taker checks off which.

According to ETS researchers, exercises like these are effective at both assessing and encouraging deeper learning. Teachers seem to agree. “The test improves motivation because students make the connection between the assessment and the classroom,” says Amy Rafano, an English teacher in North Brunswick. “The scaffolding is right in the test. Rather than having students just write an essay, the task encourages them to read source materials and adjust [their thinking] while writing. They have to understand where information comes from. This is real-world problem solving , and it gives the students a sense of why these skills are important.” At the very least, exercises like these mark a distinct departure from the generic prompts that serve as essay items on many current tests.

If experts agree on the need for radically different tests, they also agree on how difficult it’s going to be to implement them, especially under the timetable and cost constraints dictated by the Obama administration. Just designing the tests themselves is a monumental job: the writing exercise on the ETS prototype in North Brunswick, for instance, took a team of developers several weeks to create. PARCC and the SBAC must craft hundreds of similar exercises while also making sure they work well together.

Designers of the new tests must also decide how the items should be weighted. Should syntax be counted more than punctuation? Should multiplying fractions be stressed more than graphing linear functions? In addition, educators agree on the need for more open-ended questions, such as those be -ing tested in New Brunswick, but open-ended tests have drawbacks of their own. They are less reliable than multiple-choice exams (an acceptable response can take several different forms, whereas there is only one correct response to a multiple-choice question), and they are “memorable”—meaning they can’t be reused very often if the test is to have any level of security. Most important, scoring open-ended tests is more difficult and time-consuming than scoring fill-in-the-bubble tests. To ensure consistency among the raters, each item must be reviewed many times over. Scoring a short open response that consists of a sentence or two might take a minute—compared to a fraction of a second for a machine-scored multiple-choice item—and scoring an essay could take an hour.

It is a given that the new assessments will be administered on computers. This assumes two things: that students are comfortable working digitally, and that school districts have the necessary technological capacity. The first is probably a safe assumption; the second less so. Ask any state assessment director what he worries about most, and the answer is almost always some variation on “bandwidth.” In an informal survey taken by the common core R&D teams, more than half the states are already reporting significant concerns about capacity, including the number of computers available, their configurations, and their power and speed. This poses a dilemma: requiring too much technology may present insurmountable challenges for states, while requiring too little may limit innovation. Right now, the test makers are forced to essentially guess what the state of technology will be in 2014. An assessment director in Virginia, a state that already uses computer testing—but has not signed on to the common core—old attendees at a recent conference that when a rural school in his state charged all of its laptops one night, it overloaded the building ’s circuits and shut off the facility ’s heat.

Technological capacity can also narrow or enlarge what educators call the “testing window ”—the amount of time they need to schedule for administering exams. The new tests will already require more time than existing assessments, but if districts don’t have enough computers for everybody to take the tests in the same week, they will have to enlarge the window even more, spreading testing over many weeks. In that the case, students at the back end will enjoy an advantage because they will have had more time to learn the material being tested.

While the new assessments will undoubtedly be harder to score than the current fill-in-the-bubble ones, that doesn’t necessarily mean that the essays will be scored by humans. People, as you may have heard from your robot friends, need to be recruited and trained; they are subjective; and, worst of all, they are slow. PARCC, for one, says it will bypass these fallible creatures as often as possible: it wants items scored very quickly by computers to maximize the opportunity for the results to be put to good instructional use.

Because of recent advances in artificial intelligence, according to a 2010 report by the ETS, Pearson, and the College Board, machines can score writing as reliably as real people. That is, studies have found high levels of agreement with actual humans when those humans are in agreement with each other. (Given how often humans disagree, even the ETS concedes this is at best a qualified accomplishment.) Machines can score aspects of grammar, usage, spelling , and the like, meaning that they are decent judges of what academics call the rules of “text production.” Some programs, according to the ETS, can even evaluate semantics and aspects of organization and flow. But machines are still lousy at assessing some pretty big stuff: the logic of an argument, for instance, and the extent to which concepts are accurately or reasonably described.

By way of making assurances, the ETS says that machines can identify “unique” and “more creative” writing and then refer those essays to humans. Still, the new tests will be assessing writing in the context of science, history, and other substantive subjects, so machines must somehow figure out how to score them for both writing and content. Likewise, machines struggle to score items that call for short constructed responses—for instance, an item that asks the student to identify the contrasting goals of the antagonist and the protagonist in a reading passage. A machine can handle this challenge, but only when the answer is fairly circumscribed. The more ways a concept can be described, the harder it is for the machine to judge whether the answer is right. (For now, both consortia are calling for computer scoring to the greatest extent possible, with a sampling of responses scored by humans for quality control.)

The risk of all this, of course, is that in pursuit of a cheaper, more efficient means of scoring , the test makers will assign essays that are inherently easier to score, thus undermining one of the common core’s central goals, which is to encourage the sort of synthesizing , analyzing , and conceptualizing that only the human brain can assess. Flawed and inconsistent though they may be, humans can at least render an accurate judgment on a piece of writing that rises above the rules of “text production.” Maybe this is why all those high-achieving countries that use essay-type tests to measure higher-order skills use real people to score those tests. “Machine-scored tests are cheap, constitute a very efficient and accurate way to measure the acquisition of most basic skills, and can produce almost instant results,” says Marc Tucker. “But they have a way to go before they will give either e. e. cummings or James Joyce a good grade.”

There might be one other non-robotic way to bring down the cost of scoring: assign the task to local teachers instead of test-company employees. According to the Stanford Center for Opportunity Policy in Education, the very act of scoring a high-quality assessment provides teachers with rich opportunities for learning about their students’ abilities and about how to adjust instruction. So teachers could score assessments as part of their professional development—in which case their ser vices would come “free.” Teachers, however, might find fault with this accounting method.

There’s no doubt that the joint common core effort provides opportunities for significant economies of scale: individual states can now have far better assessments than any one of them could afford to create on its own. But the fact remains that quality costs. The federal stimulus funding covers the creation of the initial assessments, but the overall cost of administering the tests dwarfs the cost of creating them. In addition, the stimulus money runs out in 2014, which is only the first assessment year. The Pioneer Institute, a right-leaning Boston-based think tank that has been critical of the common core standards, has put the total cost of assessment over the next seven years at $7 billion.

Whether that number proves accurate or not, it’s clear that the new testing regime represents a huge investment that most states haven’t yet figured out how to pay for. The current average cost per student of a standardized state test is about $19.93, with densely populated states paying far less and sparsely populated states paying far more. The SBAC estimates a per-student cost of $19.81 for the new summative tests and $7.50 for its optional benchmark assessments. But the Pioneer Institute says in a recent report that those numbers are unrealistically low given the consortium’s ambitious goals. PARCC, which has scaled back its original plans, projects combined costs for the two summative tests of $22 per pupil.

How the new tests will affect the states’ already depleted coffers depends on the state. A state like Georgia, which now spends about $10 per student on testing , will likely to have to ante up more money. Mar yland’s costs should come down. Florida will have to scrap an assessment it just revised for the 2010-11 school year to align with more rigorous state standards. Whatever their situation, the states have some careful fiscal planning to do.

But let’s put things in perspective. Critics of testing habitually protest its cost, implying that the millions spent on assessment would be better put toward smaller class sizes, expanded library hours, or the restoration of art and gym. But despite testing’s huge and growing role in education, the U.S. now devotes less than a quarter of a percent of per-pupil spending to assessments. That’s less than the cost of buying each of America’s students a new textbook.

The American education system is at a major crossroads, one that few Americans are aware of. The new assessments—the product of a huge investment of time, knowledge, and talent—are only two years away from being put in place, and they’re desperately needed. It’s too early to know whether they will work as advertised, and even if they do, the danger is that states will quickly revert to their old habits of doing assessment on the cheap. But if we do this right, we could finally provide educators like Caryn Voskuil with one of the tools they need most: a test worth teaching to.