Balance Is Key to Success of the New Evaluation System

After almost a year of legal wrangling, the New York State Education Department and state teachers’ union finally came to an agreement over how to evaluate the state’s public school teachers.

Under the new system, teachers’ job performance will be assessed in part on their students’ standardized test scores and in part on rigorous classroom observations.

Both components provide substantial local flexibility, acknowledging that a “one-size-fits-all” approach cannot succeed in a diverse state like New York. This flexibility, along with the policy’s balanced use of multiple measures, treats educators like professionals who understand teaching and know their schools and communities best.

States with more draconian policies (see Florida and Tennessee and the Washington school district, for instance) have much to learn from this example.

Yet the fierce battle over weighting -- whether state test results should account for 20 percent, 40 percent or an even greater share of teachers’ evaluations -- reflects a basic lack of understanding of our ability to use test scores to evaluate teachers.

At one extreme, hard-line reformers fail to see why more weight shouldn’t be placed on clearly objective “bottom line” measures of job performance. At the other, teachers and their unions shudder at the idea of their professional work being judged by the outcome of a single test.

The reality, of course, lies somewhere in between.

Test-based “value-added” measures -- like those planned for New York teachers (and those soon to be released to the general public in New York City) -- are not, in fact, objective measures of job performance. Rather, they are student test scores that have been statistically adjusted to infer teacher effectiveness.

Analysts rely on a statistical model to predict how similarly positioned students might perform under other teachers (say, the average teacher in the district), and then determine how students actually fared under their teacher.

The goal of value-added is to separate the teacher’s contribution to achievement from the myriad other factors and year-to-year noise that affect test scores. Because, of course, though teachers play a large role in student achievement, it is well known that a child’s performance on a standardized test is the product of countless factors past and present, both inside and outside the classroom.

Isolating the teacher’s effect from these other influences is a tall order. (Think of using hospital patient records to infer the job performance of an attending physician.)

After several years of data have accumulated, value-added measures can potentially provide useful information to teachers and instructional leaders. They can offer an early warning signal for teachers who are lagging behind, highlight subject areas in need of improvement, and provide an opportunity to recognize and reward good teaching.

For new and inexperienced teachers, however, they provide relatively little information about teaching effectiveness. Estimates of value-added for these teachers can (and do) come with a large margin of error.

Separating out the individual effects of collaborating teachers (or those teaching different subjects in high school) is particularly difficult. And when one attaches high stakes to a measure over which teachers feel they have little control, it raises the likelihood of cheating and teaching to the test, ultimately undermining that test’s usefulness for evaluating teachers or students.

In their agreement, the state Education Department and teachers’ union struck a delicate balance between a useful -- but very imprecise -- measure of student progress under a given teacher, and respect for the professional judgment of educators.

State educators should resist the temptation to place more emphasis on tests that are not up to the task.