The Viability and Fairness of Value-Added Models for STEM Teachers

Howard Wainer

Is judging teachers based on student's test scores fair?

Some ideas only make sense if you say them fast. The currently trendy notion of assessing and comparing teachers based on the change in the test scores of their students is one of these.

The basic idea seems eminently reasonable -- students take a test in the beginning and at the end of an academic year and the teacher is rated based on the average gain (suitably adjusted) in those scores; this is the teacher's 'value-added' (VAM) score. Teachers with a high value-added score receive high ratings, and are duly rewarded. Teachers with low VAM scores are slated for mentoring, and, if no serious improvement occurs, reassignment or dismissal.

So what's the problem? Actually there are a number of knotty technical and logical issues that have yet to be solved. An especially tough one deals with issues surrounding the measurement of the extent of the causal connection between the teacher and the score change. This issue haunts all procedures whose principal goal is to assign credit or blame. Why would we believe the causal implication of "Freddy gained ten points because he had Ms. Jones" when we wouldn't believe "Freddy grew four inches points because he had Ms. Jones?" Obviously changes in student performance have many plausible causes, not all of them under the control of the teacher.

But the issue that is especially relevant to STEM courses relates to the load that VAM places on tests and the psychometric theory that scores them. Since the focus is on the gain score, to make comparisons between teachers we must believe that a gain from a score of, say, 20 to 30 has the same meaning as a gain from 90 to 100. While this could be close to plausible with very well designed tests for the same subject and grade, who would believe this if one student took an arithmetic test and another one in pre-calculus? Some believe that this problem of test scoring can be resolved if instead of test points we use the change in the student's rank among fellow students - this is the basis of the so-called "Colorado Growth Model" that is currently favored in Colorado, Hawaii and New Jersey, among others. But student ranks are based on the other students who take that exam. Students who take STEM courses tend to be of higher caliber than students who don't, and hence they are competing in a faster field. This would obviously disadvantage teachers of STEM courses.

But if comparing teachers and students in courses on the same subject (e.g. mathematics) is problematic, how can we make cross-course comparisons? Is your ten-point gain in French equal to my ten-point gain in physics? Was Einstein a better physicist than Mozart was a musician?

It is this cross-course comparison problem that is also likely to cause difficulties for STEM teachers. Again, how can we compare the test score gains for exams in STEM courses with student score gains in less demanding courses? If we compared physicians by their patients' cure-rates who would go decide to go into oncology or gerontology?

Of course such problems would be eased if comparisons were only made among teachers who taught the same course, but such an approach is of such narrow utility, given that in most schools there is often only a single teacher for advanced technical courses, that it is unlikely to be adopted.

What should be done? It remains possible that my concerns might not yield large enough effects in practice to outweigh the prospective usefulness of the methodology. I am not sanguine, but we have limited data at this moment to completely rule this Pollyannaish possibility out. So let me propose an experiment. Currently a number of states are doing pilot studies of the efficacy of VAM, but, to my knowledge, none of these pilot studies have fully comparable control groups against which to measure their success. Also, full implementation is usually planned to occur before any data generated by the pilot program has been gathered and analyzed. Of course once implemented statewide there will be no districts left from which to constitute a control. This is reminiscent of the familiar line in many westerns "First he'll get a fair trial, and then we'll hang 'em."

We should slow down. A useful pilot test would pair districts in which such an evaluation method has been implemented with matched districts that do not have it. They could then compare the results on the basis of a full set of outcomes, including such variables as student performance, faculty satisfaction, cost, etc. Then, after we have the evidence, we can decide on VAM's efficacy.

It is important to remember Mae West's wise observation that "anything worth doing is worth doing slowly."

Howard Wainer is Distinguished Research Scientist at the National Board of Medical Examiners. This essay was abstracted from his most recent book Uneducated Guesses: Using Evidence to Uncover Misguided Education Policies.

NEWS & CURRENT EVENTS ...

The Viability and Fairness of Value-Added Models for STEM Teachers