Evaluating the effectiveness of teachers on their student’s year-to-year growth on standardized tests is not all that stable. Teacher effectiveness scores in one year have a correlation of only about .30 or .40 with scores in the following year (the positive number means they move in tandem, and a perfect correlation of 1.0 would mean they moved in complete unison). That’s not very good, right?
The answer to that question depends on what you compare it to. A new paper from the Brookings Institution points out that value-added scores for teachers is not out of line for evaluation of complex jobs. Home sales for realtors, investment returns on mutual funds, utility company worker productivity, and the output of sewing machine operators all had year-to-year correlations between .33 and .40. For professional baseball players, the season-to-season correlation in their batting average is .36. The SAT has a .35 correlation with college freshman GPA. In other words, value-added measures for teachers are right in line with other professions.
What about the metrics we use already to evaluate teacher quality, things like academic credentials and years of experience? To answer this question, the authors consider the case of a district making layoff decisions, in this case New York City. If New York decided it needed to lay off five percent of its workforce, and it did so entirely on the basis of seniority, it would lose some ineffective teachers and some effective teachers (we see this happen in the real world when teacher-of-the-year winners are selected for layoffs based on their seniority). To show what this looks like in terms of effectiveness, the authors adopted a graph from an earlier report from the Urban Institute. The black line is the overall spread district-wide in teacher effectiveness. The red line represents the teachers laid off if experience was the only factor, and the blue line shows which teachers would be laid off if the decision was based on effectiveness alone.
It’s fair to point out that the blue line in this example would represent the value-added scores of teachers in the prior year. Those scores are not perfect indicators of a teacher’s effectiveness, some of those teachers could go on to very good careers,and making layoff decisions based upon them will mean some teachers were laid off unfairly. But, our current system of using experience as the sole determinate of layoff decisions brings unfair decisions of its own. Value-added should be put in the proper “compared to what?” context. In that light, value-added is the worst form of teacher evaluation, but it’s better than everything else.

Click Image To Enlarge


{ 2 comments }
Fully agreed on the “compared to what?” point. There’s way too much dismissal of value-added based on imprecision that is shared by all measures, and that is frustrating to me.
But your illustration of this principle is a little circular. You are saying (and I agree) that quality metrics should be assessed in comparison with their alternatives. But how do we carry out that assessment?
For example, how do we assess the validity/reliability of experience? In your post, you (or Brookings) illustrate how we might assess it in terms of value-added – i.e., seniority-based layoffs would dismiss a wide distribution of teachers, many of whom have VAM estimates above average (not counting error). This strikes me as more of an argument for value-added than a meaningful comparison of different metrics, since (obvious point here) any measure that was assessed in terms of value-added could never compare favorably to value-added itself.
Think about the converse situation, in which someone criticized a policy of making layoffs based purely on value-added scores by showing how these layoffs resulted in a large number of experienced teachers being dismissed. You wouldn’t be impressed. But, in both cases, one measure is essentially being posited as the desired outcome, and this, rather than a meaningful contrast of quality measures, is driving the conclusion.
So, I agree on the comparative imperative point, but, at least for the moment, it’s tough to carry out these bivariate comparisons without imposing our “preferences” for other measures. That the two are not strongly-correlated may not necessarily mean that they don’t both signal quality.
One alternative (per the Brookings paper) is to compare the stability of each teacher measure to those used in other professions. But this really just shows that measures are similarly imperfect in other contexts, rather than giving us a way to compare different measures within education. One big problem is that value-added is fundamentally different than the other common measures you mention – unlike experience and education, true teacher effects are not directly observable (principal observations would be a better comparison). In the end, we may have to use composite measures, and assume that the various components all signal quality in some way.
Regardless, completely dismissing value-added solely because of its imprecision is a weak argument. The reason it occurs is, in large part, a communication/trust issue.
I am so disappointed by the intellectual dishonesty of the Brookings paper. I’ve always admired Whitehurst even when I disagree.
Firstly, using a metanalysis was indefensible because he compared sectors that might be comparable, and methods that might be comparable with sectors that can’t be compared to teaching. For instance, the real estate and investment sectors were using monetary changes to predict other monetary changes. That is not remotely comparable to using test score changes to predict learning changes. This issue will grow when VAMS from elementary data are used to create targets for the very different world of high school. Each time a number is used as a rough surogate for reality, and a new level of distance is added.
Education, like health care, adds a whole new set of variables. But the health care examples he cited were using flawed data for REPORTING not firing, a distinction he’d previously made. After all, what I’ve been proposing has been using testing for a Consumers Reports, which is exactly what they did in health care.
In baseball, averages are used by coaches who also see the actual swing. Coaches know the difference between swinging at a 100 mile an hour fastball vs a 80mph one, or they can see how the batter does against a sharp curve vs a lousy one. That why good coaches and good scouts look at FUNDAMENTALS. In other words, effective coaches and scouts look use Input accountability, not output accountability to build teams.
All that study really did was ask whether using VAMs for teachers would be more unfair than using similar data for other jobs. It did not address whether it would be stupider to do so. What would be stupid would be to use batting stats to encourage contact hitters to swing for the fence or vica versa. Using VAMs would do just that, forcing educators to take their eyes off the ball, which is teaching and learning, and do test prep, which is educational malpractice.
Getting back to his fairness focus. He’s saying that next year’s numbers might not be more unfair that last year’s numbers. He never asks whether this first year’s numbers are accurate or not. Whitehurst, of all people, should know how variable test quality is throughout systems throughout the nation.
Here’s the issue. If you use data-DRIVEN evaluations, where test scores from VAMs indict teachers as ineffective then you completely poison the well. If you want data-INFORMED evaluations where equally flawed data is used to supplement or complement evaluations, that’s different.
I wonder why Whithurst has gotten so punitive and arguementative. I fear that he’s representative of what happening with “reform.” Reformers, who were educational novices, entered the field wanting to help kids. now out of frustration they are obsessed with defeating enemies. Whitehurst did not author a social scientific paper. He authored a morality play.
Comments on this entry are closed.
{ 2 trackbacks }