Stories from the NY Times, Mother Jones, and the Washington Post bemoaned the flat National Assessment of Education Progress (NAEP) reading scores released Wednesday. Jay Matthews called it the epitaph of the No Child Left Behind era. The results aren’t quite so simple.
See, NAEP is different than most standardized tests. It takes a sample of the current population in every state, so this year’s population of kids is compared to the last time the test was administered. There’s an automatic correction for changing demographics, so as America has gotten less white, so has NAEP. In statistical terms this creates something called Simpson’s Paradox, which makes trend lines seem worse than they really are because of a hidden variable, in this case, race (Matthew Yglesias touched on this point yesterday).
To show how this impacts NAEP scores, here are the results of the long-term trend NAEP results for fourth-grade reading from 1975 to 2008 (I’m using the long-term trend version of NAEP, because it’s been largely unchanged since its first administrations in the 1970s. Its had one significant format change, in 2004, when NAEP administered both the new and old formats. Hence the dotted and solid lines in all of the following graphs). As the chart below shows, average fourth-grade reading scores have risen only modestly, from 210 in 1975 to 220 in 2008.

This is basically the same thing that showed up in yesterday’s results. There’s been some small gains over time, but year-to-year progress has been small.
But that’s not the whole story. See, these overall trend lines are a sampling of America. As we’ve become more diverse, NAEP has changed its sampling ratios to reflect our changing society. This chart shows the percentage of students drawn from racial/ ethnic categories over time. In 1975, NAEP test-takers were 80 percent white. By 2008, only 56 percent were. There were three percent more blacks in 2008 than in 1975, and Hispanics had quadrupled from five to 20 percent.

So, because NAEP has gradually included more black and Hispanic students, and black and Hispanic students score lower, on average, than white students, the total score doesn’t reflect the true gains made by each group. The chart below shows scores taken from the same testing years, this time disaggregated by race.
Each group has actually made greater gains over time than the overall total. White students increase 11 points, one more than the national average. Black students scored 23 points higher, and Hispanic students were scoring 24 points higher in 2008 than they were in 1975 despite quadrupling in size. In other words, the white-black and white-Hispanic gaps are closing and every group is scoring higher, but the national score is showing more modest improvements because of demographic changes.

This is an important distinction to make, because it means the test score results are not just a matter of classroom teaching and learning (to be completely clear, I don’t think NAEP results can be easily attributed to national education policies like NCLB either). The students themselves have changed in important ways, and to break even or to make small achievement gains as society becomes more diverse is an accomplishment worth celebrating. At the very least it’s worth understanding.
For more background reading on NAEP, try Education Sector’s NAEP Explainer.






Better Benefits: Reforming Teacher Pensions for a Changing Work Force
The Course of Innovation: Using Technology to Transform Higher Education
[...] Willingham sorts through NAEP data and points out that Chad Alderman is right that 4th grade reading scores look much better when the data are disaggregated by race. But the [...]
[...] ~ April 6th, 2010 in Teaching | by Adam Ozimek Chad Aldeman at the excellent Quick and the Ed explains how improvements in the NAEP reading scores are masked by looking at overall scores because of [...]
[...] ~ April 6th, 2010 in Teaching | by Adam Ozimek Chad Aldeman at the excellent Quick and the Ed explains how looking at overall improvements in the NAEP reading scores masks improvement because of [...]
[...] Chad Aldeman at The Quick and The Ed: Stories from the NY Times, Mother Jones, and the Washington Post bemoaned the flat National Assessment of Education Progress (NAEP) reading scores released Wednesday. Jay Matthews called it the epitaph of the No Child Left Behind era. The results aren’t quite so simple. [...]
Joe: The original post answers your question. I wrote, “I’m using the long-term trend version of NAEP, because it’s been largely unchanged since its first administrations in the 1970s. Its had one significant format change, in 2004, when NAEP administered both the new and old formats. Hence the dotted and solid lines in all of the following graphs).”
John and Samantha: Both good points that are related to each other. I suspect any demographic changes are going to have hidden and important effects.
I don’t understand these graphs. Why is the line between 2004 and 2008 solid, while the other lines are dashed? Why are there two data points for each ethnicity in 2004, but only one in all the other years? Why is the lower of the two data points used for the left side of the solid line? This makes it appear (perhaps falsely) like there was a nice steep improvement from 2004-8, when using the upper of the two points (the one contiguous with the rest of the data) shows much less of an improvement.
[...] from the 2009 National Assessment of Educational Progress, Education Sector’s Chad Alderman offers a different perspective. He notes that if you break down the results — and realize that the [...]
So, can we expect a similar explanation for D.C.? Their NAEP sample dropped from 91% Black in 1992 to 80% Black today, and their sample is down to 73% eligible for free and reduced lunch. And since D.C.’s exclusion rate for special ed was double the national average and nearly triple the pre-NCLB period, I wonder if that’s a factor.
My state has seen a dramatic drop in NAEP scores for 8th grade reading, despite our soaring NCLB rates. And the biggest share of our state’s lowest performing students are in our district (with 91% eligible for free and reduced lunch). Our superintendent would think he’d died and gone to heaven if he faced evaluation on a demographic sample like D.C.’s.
Chad,
Thanks for the insight and clear explanation. In looking at the 2004 and 2008 data in particular, do students with disabilities also play a role? On the 2008 LTT NAEP (Reading at Age 9), SD accounted for 9% of the assessed population and had an average scale score of 182. That’s certainly a lot lower than any of the subgroups by race, but I don’t know if the percent included is large enough to significantly impact the aggregate score. I’d love to hear your thoughts.
Thanks,
Samantha
thank you mans
Is this really a paradox? That is, the relationship isn’t really reversed, its just muted. It seems more like an ecological fallacy than anything, since they’re essentially presenting aggregate data and drawing dire conclusions about “students” and NCLB.
Also, if the demographic shifts were such a confound, wouldn’t the aggregate trend be reversed, since some of the groups that are receiving greater representation in the aggregate results are growing at a faster rate than the group with declining representation? My understanding of Simpon’s paradox is the small/large n sizes of constituent groups causes the aggregate data to imply conclusions directly at odds with those derived from the between-group analysis.