Racial differences in the long-term trend NAEP scores (1975/78-2012)

I analyze the LTT NAEP achievement scores, a public data set available at NCES. In general, minority-majority ethnic groups show a secular decline in d gap, for both math and reading tests, and this occurs at all ages of assessment (9, 13, 17), and at all percentile levels. Some exceptions are noteworthy. There is no secular gain at age 17 among whites, and no meaningful decline in black-white difference for the NAEP math at ages 13 and 17. Within each year of assessment, no evidence is provided for the hypothesis that the racial gaps (notably, the black-white gap) widen with age after entering schools. There was simply no trend at all.

Review of earlier studies

Herrnstein & Murray (1994, p. 291) provide their estimations of the NAEP trend, from 1970s to 1990. The mean scores were around 250, the d gaps calculated with SD assumed to be 50. The tests involve science, math and reading, all being given at age 9, 13, 17. The overall d is 0.28 for the black-white decline. There was no indication that the gap reduction was less among any component or at any given age.

Hauser & Huang (1996) attempt to replicate Herrnstein & Murray, but they find a gap reduction of 0.39. Their comment reads as follows : “The difference is mainly due to the incorporation of variation by age in the larger overall value, whereas the black-white comparisons should have been conditioned on age, just as Herrnstein and Murray attempted to condition on age in their regression analyses of the effects of the AFQT. The effect of choosing too large a standard deviation was to understate both the initial black-white differences and the changes in test scores over time in standard deviation units.” (p. 17). In a subsequent analysis, they graph the black-white trend by cohort. In general, most of the black-white gap decline occurs between the end of 1960s and early 1970s for birth years and all ages (9, 13, 17). They also graph the black scores by percentile levels (5, 10, 25, 50, 75, 90, 95). The reading scores, from 1971 to 1992, improved during the 1970s for ages 9 and 13 but only during the 1980s for ages 17. Concerning science scores, from 1977 to 1992, black scores improve in 1982 and again in 1986 (at all percentiles) but stop after then. The math scores were the only tests showing larger gains at the lowest percentile levels. And they find no consistent trend showing that the black-white achievement strengthens with age.

Hedges & Nowell (1998, Figure 5-2) have also examined the NAEP score gains by percentile levels for blacks and whites between 1975/82 and 1992/94. Except for Science, math (for blacks and whites) and reading (blacks) scores improve much more for the lowest percentiles. Reading did not improve for whites. The black scores started to drop somewhat after 1988.

Grissmer et al. (1998) compute NAEP score gains after adjusting for (percent) changes in family background (measured in 1970, 1980, 1990). The gains occur mostly during the 1970s for all ages and were more pronounced for math. Their main goal was to attempt to explain the black score changes. School desegregation plans have been very extensive in the US south, between 1968 and 1972 and the black gains were the most pronounced in the South region, however the black scores also improve in other regions not subjected to desegregation such as the Northeast (Figures 6-9 & 6-10). Furthermore, Figure 6-6 shows that the score gains were greater in regions more subjected to desegregation plans. However, the authors are more enthusiasts than necessary. When looking at Figure 6-10 (below) we can see that the trend and gaps for Southeast vs other regions at age 17 are very similar. Certainly, the West and Midwest have also proceeded to desegregate the schools but the process was very slow. Thus, the pattern and timing of desegregation process for the “other regions” do not match the dotted line in Figure 6-10 for age 17. Jensen (1973, pp. 101-102, 257-258) provides himself with the evidence that racial segregation has little (if not at all) explanatory power in the black-white difference in IQ or achievement. Regardless, Grissmer et al. remark that the trend toward smaller class sizes (i.e., lower pupil/teacher ratio) between the 1960s and 1980s, for all age groups, play a substantial role in the NAEP black gains over time (about 0.30 or 0.40 SD; see unstandardized coefficients in regressions presented in Table 6.5), but not at all among whites at age 13 and 17. Teachers’ experience and education don’t explain the relative black gain and the changes in teachers’ characteristics over time simply do not match black-white achievement changes over time. These authors also attempt to explain the stagnation (or drop) in black scores after 1988, especially for reading scores. They speculate that the upward rate in violence since 1985 could have produced adverse changes in black neighborhoods. They cite a study by Jeff Grogger (1997) showing that the math scores at 12th grade (holding constant 10th grade scores) were lower when the principals themselves have reported more violent climate in their schools.

The Black-White Test Score Gap (Jencks & Phillips, 1998) Figure 6-10

Gottfredson (2003, table 4) also examined the NAEP trend. There had been a trend for declining racial gaps in all ages (9, 13, 17) for both math and reading for the period between 1971 and 1999. The black-white ds for 1970s, 1980s and 1990s were 1.17, 0.84, 0.73 for reading, 1.07, 0.96, 0.87 for math, and 1.23, 1.13, 1.08 for science, among the 17-years-olds. The respective ds for hispanic-white gaps are 0.92, 0.63, 0.63 for reading, 0.85, 0.81, 0.73 for math, 0.79, 0.91, 0.82 for science. Gottfredson concludes that the NAEP convergence is still coherent with a prediction that the black-white and hispanic-white gaps are about 1.2 and 0.90 SDs.

Sackett & Shen (2010, table 12.4) report that the ds have narrowed for black-white gaps in reading in all ages (grade 4 & 8) while the ds have narrowed in math for grade 4 but not grade 8, between 1975/78 and 2004. The hispanic-white ds have narrowed over time for all grades. On the other hand, these authors (2010, table 12.2) also indicate the absence of narrowing racial gaps (black-white or hispanic-white) in SAT-Math and SAT-Verbal and SAT-Writing, as well as ACT subtests (and composite). For both tests, more than 1 million students take it annually.

Rindermann & Thompson (2013) show, with respect to 17-year-olds, that the black NAEP scores improved (by somewhat less than 10 “IQ” points) only during the 1980s, and this entire period seems to explain the black-white convergence between 1971 and 2008. The hispanic scores also improved during this period, but by much less (about 5 “IQ” points). The period 1988-2008 shows a great stability in scores in all groups. The trend is depicted below. But they argue the possibility that the score gains could have been higher, due to a rise in school attendance (around 89% in 1975 to 95% in 2009 for the US).

Ability rise in NAEP and narrowing ethnic gaps (Rindermann, Thompson, 2013) Figure 3

The modest or absence of black score improvement after the 1980s has attracted several comments. Rindermann & Thompson (2013) speculate that the end of black-white convergence has coincided with the start of crack epidemic around 1985, in which period the real wages for unskilled workers (among whom blacks are over-represented) were also declining and the number of single mother families skyrocketed.

Method & Results

The NAEP is usually said to be one of the most representative and unbiased sample. The number of subject per assessment year is more than 26000 for either math or reading test. The public data is available at NCES. Examples of questions and items available here. Importantly, NAEP only reports student group results if the student sample size is 62 or more. An illustration about how to use the NCES program is available here. Or here, alternatively. The variables used are the NAEP math and reading (science was not available to me), at ages 9, 13, 17, meaning respectively grade 4, 8, 12. I use the variable “National” (for all schools, public+private) and the variable race (4- and 6-categories). The scores (and other numbers) for the entire sample can be obtained if we select the variable “All students”.

I use 4 alternative methods to compute the ethnic/racial d gaps. 1) divide the mean difference by the N-weighted averaged SD of the two groups. 2) divide the mean difference by the averaged SD of the two groups. 3) divide the mean difference by the SD of the reference group (i.e., whites). 4) divide the mean difference by the SD of the entire sample. The methods hardly matter in general, as they produce no meaningful difference. Concerning the size of SD over time, it diminishes for math (all groups, all ages) but not for reading.

The reason to present alternative methods is due to the varying US racial composition. While the blacks remain nearly constant at 15% of the NAEP entire sample for all years and all ages, the percentage of whites took a fantastic hit, going down from 80% to 55%, while the hispanic and asian percentages go up. Thus, when using method 1, which consists in N-weight averaging the SDs, given the difference in SDs across groups and the trend in percentage (the N) varying over time, it is prudent to be able to use an SD that is more invariant over time. For the present analysis however, I will focuse on method 4. The ds are a little bit smaller because the SD estimate is larger when using population SD.

The spreadsheet for the entire analysis is available here.

1. NAEP LTT (1975/78-2012)

Score gains

The black, hispanic and asian scores have been on the rise over time, at all ages, and all percentile levels. The exception is the white sample. There is no gain in NAEP reading (age 17) and only a very slight increase in NAEP math (age 17) and NAEP reading (age 13). And, at age 17, whites improve their score on math, but only at the lowest percentile levels.

Interestingly, when looking at the NAEP scale scores, we can see that the math is the one having the biggest Flynn gains for all years (although, as noted above, there is hardly any changes among 17 year-olds). This is entirely consistent with Rindermann & Thompson (2013, Table 1). And yet, as we will see below, the black-white d gaps have shown the biggest declines for the NAEP reading. This is the clearest proof that the Flynn gain has no relationship with the black-white cognitive changes (Ang et al., 2010).

The absence of secular gains among 17-year-olds is all the more impressive. IQs have risen over time, and although there is still some conflicting views (Lynn, 2013) as to whether the Flynn gain is diminishing over the ages, the IQ gains among adults have been quite substantial. This indicates the achievement and IQ tests, although highly correlated, seem to have different properties. See Flynn (2010, p. 365). Another set of data reported by Jensen (1998, p. 322) concerns the SAT (verbal+math) for which there had been a large secular decline, and a decline that was stronger than the IQ gain. For details, see Murray & Herrnstein (1992). But unless this pattern is replicated for other achievement tests, one can still argue it was an outlier. In general, the explanation of this divergence could be that the achievement tests have a stronger crystallized component. It is well known (Flynn, 1987) that the IQ gains were greater among fluid tests. This is not the entire story because, as noted above, the secular (achievement) gains were not observed at age 17. More importantly, Flynn (1987) has estimated an IQ gain per decade of about 3.74 points and 5.88 points for verbal and fluid tests. Rindermann & Thompson (2013) report only 1 IQ point gain per decade for the NAEP.

Now, we move to the racial differences. Remember, first, that gap changes and score changes do not necessarily be coherent with each other. As Murray (2005) cogently noted, black-white test score gaps have been reduced more at the higher levels but the black score gains were stronger at the lower levels. This is not a contradiction and, clearly, these two things need to be distinguished. Gap is gap. Score change is score change.

Black d gaps

The BW ds have a pattern of declining gap for reading at all ages and all percentile levels. One notable detail concerns the NAEP math for 13-years-olds. Regardless of the methods, there is a very slight decline in ds. Also, the ds do not have a declining pattern among 17-years-olds. This is curious because blacks improved their scores, in both the math and reading NAEP. Whites improved their scores on the NAEP math, but only modestly (306 to 314) compared to blacks (268 to 288) between 1978 and 2012, even though all of the gap narrowing occurred between 1978 and 1990. So what’s the explanation for the absence of narrowing ? Is it due to the method of calculating the ds ? As mentioned above, the 4 alternative methods I use produce similar results. My failed attempt to replicate Gottfredson (2003, table 4) is all the more annoying. Gottfredson’s data stop at 1999. Even when I look at the years until 1999, there is no trend toward diminishment in ds. She divided the difference in means by the SD of the total sample, and I did the same. But in her computations, the ds for 1970s, 1980s and 1990s were 1.07, 0.96 and 0.87 for the 17-years-olds who took the math test.

But whatever the case, if my values can be taken at face value, we will see that, as the children age (9->13->17), the improvement in black (math) scores relative to white (math) scores diminishes toward the null.

One very popular hypothesis involves the cumulative deficit hypothesis, that Jensen (1973, pp. 97-102, 1974, 1977) dealt with a long time ago. We can answer this question, by examining the black-white gap within each year. This is possible, as soon as we look at the reading and math test separately because they (both) are not necessarily administered the same year. There is no increasing gap for reading but a small one (generally around 0.15 SD) for math. Previous studies have also examined this issue. Phillips et al. (1998, pp. 234-238) provide a proof that the black-white test score gaps do not widen after children enter school. Sackett & Shen (2010, p. 339) report black-white gaps that are similar for preschool (d=0.92) and elementary school (d=0.90). And this, again, is coherent with the idea of no increase in test score (and IQ) gaps with age. At least, in cross-sectional data analysis.

Hispanic d gaps

There is a large decline in ds over time for all ages and for both math and reading. With one (curious) exception. The NAEP math at age 9 has no clear pattern of trend; from the 1980s to 1990s, the ds increase, but diminish again in the 2000s. Nonetheless, the ds in 2000s are (somewhat) smaller than in 1980s.

Looking at age 13, it is not obvious which one, reading vs math, has shown the biggest decline in ds. But for the other ages, the ds for reading have been reduced much more than math. Concerning the black-hispanic gaps, there is no tendency toward greater (or smaller) score divergence over time.

For the cumulative deficit hypothesis, once again, there is no clear evidence that the gap widens with age (within each year). The only piece of evidence for widening gap concerns the math test for the most recent years (2004 to 2012). For reading, there is a slight tendency of decline in gaps over ages.

Asian d gaps

The ds for asian samples vary a lot. They are sometimes negative, near-zero, and the sign of the ds depend on whether we look at the reading or the math test. At age 17, for reading, the asians score at 0.20 below whites, but in recent years, the d gap is around 0.00, while for math, they score at 0.20 above whites although I don’t see any trend toward growing NAEP gap. At age 13, for reading, the asians score sometimes 0.10-0.30 above whites and sometimes below whites (by the same magnitude) while for math, they score at around 0.22 above whites. The ds vary too much, however, but there is a trend toward to growing NAEP gap between asians and whites, to the advantage of asians. At age 9, for reading, there is no coherent ds in both size and signs so that asians start higher than whites, and then lower, and then higher, while for math, there is perhaps a very slight tendency toward a growing asian advantage and the ds in general are around 0.25. In general, there is a tendency for asian advantage in most recent years, for which years the samples are more likely to be larger than in earlier years. For the period considered (1975/1978-2012) the percentage of asian in NAEP sample jumped from 1% to 6%. With the hispanics (who jumped from 5% to 22%) the asians are among the groups growing (in share) the faster in the US population.

Concerning the racial cognitive change with age (within year), the pattern is very curious. The most recent and earliest years shows asian advantage in both reading and math but this advantage fade out as they age (9->13->17) whereas at the middle of the years the reading (math) score is lower (equivalent) for asians but there is no clear trend with age. Once again, if the estimations in recent years are more reliable due to large sample size, it would mean that asians lose some of their initial advantage that was noticeable at early ages. We should note however that the data is only cross-sectional, and thus not suitable for the detection of this kind of effect.

2. NAEP Main (1990/92-2013)

The result is also available in the spreadsheet, above. The NAEP Main has data for every 2 years (grade 4 and 8), while LTT has data for every 4 years (age 9, 13, 17). See here for the slight difference between the two surveys. The variables are almost the same. I chose composite scale for measure, national for jurisdiction, and race/ethnicity used to report trends, school-reported (SDRACE).

The black-white gap has narrowed by 0.15 SD in math for grade 4 and 8, the narrowing in reading for grade 4 was comparable but weak for grade 8. The hispanic-white gap has narrowed for math by about 0.15 SD only if we include 2011 and 2013. Otherwise, no trend is visible. In general the trend has an inverse U-shape (smaller in early 1990s and 2010s but larger in the middle). The gap narrowing for reading is modest in grade 4 but not for grade 8. For asian-white gap, there is a clear tendency for asian gain over white for math (going up from 0.10 to 0.25 SD) whereas the asian gain over white for reading is very slight (perhaps 0.10 SD).

3. HSTS (1990-2009)

The High School Transcript Study shows the trend in GPA scores (see spreadsheet) with a curious increasing gap for black and hispanics, relative to whites. There were 4 data points, 1990, 2000, 2005, 2009. The black-white d gaps go up from 0.50 to 0.70 while the hispanic-white d gaps go up from 0.20 to 0.70 (and the asian advantage over whites remained constant at around 0.25). However one limitation of GPA could be that the GPA gap is affected by differential course selection over time (Arcidiacono et al., 2012).

Vanishing gain in later age

Rindermann & Thompson (2013) make several assumptions as to why the gains may have stopped for older youths. The secular gains may have simply speeded up child development. Or, the genetic effects strengthen from childhood to adulthood. The effect of preschool reforms fades out. There is a stronger negative peer pressure in adolescence against learning today than before. Instructional quality in secondary schools improve less (or deteriorate) relative to primary schools.

Indeed, improvement at early ages do not necessarily generalize to later ages, and Grissmer et al. (1998, pp. 188-189) find this pattern consistent with the general picture that education programs do not improve IQ in the long-run. But Grissmer also explains the possibility to change the gap at later ages while leaving the gap at early ages constant over time. For example, the high school curriculum could have been tougher over time but the primary school curriculum was left unchanged over time. It’s as if the authors wanted to say that the failure of interventions to improve IQ at adulthood could have been that the curriculum should have been made highly intensive at both earlier and later age. However, this is not what IQ is purported to measure. A long time ago, Jensen (1973, pp. 79-95) explained that intelligence is the ability to understand the material that has been taught. If, to be sustainable, cognitive abilities must require the persistence in rehearsal of the acquired knowledge, the gains cannot be g-loaded. Learning effect must be maintained without rehearsal, to be considered as real IQ gain. That is, the IQ gain must implicate (far) transfer effect (Jensen, 1998, p. 334). And, to be causal, early preschool programs must provide the necessary resources to equalize opportunities and chances. But when such measures leave the adult gap unchanged, that is, when early background and (cognitive) characteristics are held constant, a difference in cognitive growth must involve additional, non-cognitive factors unrelated to school (Phillips et al., 1998, p. 256). Jensen (1973) would have predicted that equalization of opportunities does not imply equalization of learning and while it is true that scholastic test is not equivalent to IQ test, partialling out the former must also partial out most of the effect of IQ, and given the large cumulative deficit effect found in Phillips et al. in math, reading, vocabulary (0.02, 0.06 and 0.05 SDs per year), it seems unlikely that IQ alone can account for the cumulative deficit in achievement score.

Detection of gap*age interaction

Concerning the question of gap*age interaction, it is far from being settled. Jensen (1974, pp. 1005-1006) mentioned that cross-sectional studies are subjected to external factors varying over time and correlated with age, which will in the end affect the later outcome, and sometimes create a spurious interaction of cognitive gap with age. Longitudinal study is the proper approach to examine such effects. When Phillips et al. (1998, p. 244-256) use longitudinal data, they show the black-white score gap widening, a pattern not found when they use cross-sectional data. Longitudinal studies (Jensen, 1973, pp. 97-102, 1974, 1977; Farkas & Beron, 2004) on IQ shows apparently no such cumulative black decrement, except when the environment was suspected to be relatively deprived (e.g., the U.S. South).

Achievement/IQ gap prediction

Even if we could accept the idea of gap reduction between races, the lower d values for recent years do not depart seriously from the prediction one can make based on the IQ difference. Jensen (1998, p. 273) describes how we can calculate such regression’s prediction :

What exactly does a validity coefficient, rxc, tell us in quantitative terms? There are several possible answers, [2] but probably the simplest is this: When test scores are expressed in standardized form (zx) and the criterion measures are expressed in standardized form (zc), then, if an individual’s standardized score on a test is zxi, the individual’s predicted performance on the criterion is zci – rxc x 2xi. For example, if a person scores two standard deviations above the group mean on the test (i.e., zxi = +2) and the test’s validity coefficient is +.50 (i.e., rxc = +.50), the person’s predicted performance on the criterion would be one standard deviation above the group mean (i.e., zci = rxc x zxi = +.50 x 2 = +1).

Similar application can be used to form a composite from multiple predictors given their individual ds and the average intercorrelation among all of the predictors (Sackett & Ellingson, 1997, pp. 713, 718).

If we want to apply it for predicting score gains and/or score gaps over time, say, the hispanic-white IQ gap is at best 0.85 SD (Roth et al., 2001, Table 7), and the correlation (corrected for unreliability) between NAEP and IQ is 0.70 (Gottfredson, 2003, Table 3), the predicted NAEP would be 0.85*0.70=0.595. That is even higher than the 0.50-0.60 for most recent years. With regard to the black-white NAEP gap, the most recent years give a d of about 0.80-0.85 (math) and 0.60-0.65 (reading). Given a black-white IQ gap of 1.1 SD (Dickens & Flynn, 2006a, 2006b) we can predict a NAEP difference of about 1.1*0.70=0.77. The NAEP gap is close to this prediction when averaging the two tests. These minimum ds are what we would expect if we assume the IQ gaps (uncorrected for reliability) are about 1.1 and 0.85 for blacks and hispanics. If the prediction is validated, we can assume the IQ gap has not been reduced yet. However, we could go further. If we accept a value of 0.70 for the heritability of g at adulthood, and assume that the between group heritabilities are of the same magnitude, the predicted NAEP ds would be 1.1*0.70*0.70=0.54 for blacks and 0.85*0.70*0.70=0.42 for hispanics. This is far from being the case. One caveat regarding those predictions concerns my discussion about the bad use of correlations. Even if two variables can be highly correlated, they need not show similar mean score trajectories over time. High correlation is even compatible with improvement in one test and decline in the other. This is because correlation deals only with score rank-ordering. Not magnitude of scores and/or differences.


Gottfredson, L. S. (2005). Implications of cognitive differences for schooling within diverse societies. Comprehensive handbook of multicultural school psychology, 517-554.
Grissmer, D., A. Flanagan, & S.Williamson (1998). Why Did the Black-White Score Gap Narrow in the 1970’s and 1980’s?. In C. Jencks and M. Phillips (Eds.), The Black-White Test Score Gap (pp. 182–228). Washington, DC: The Brookings Institute.
Hedges, Larry V., & A. Nowell (1998). Black-White Test Score Convergence Since 1965. In C. Jencks and M. Phillips (Eds.), The Black-White Test Score Gap (pp. 149-181). Washington, D.C.: Brookings Institution Press.
Herrnstein, R. J., & Murray, C. (1994). The Bell Curve: Intelligence and Class Structure in American Life. New York: Free Press.
Hauser, R. M., & Huang, M. H. (1996). Trends in Black-White Test Score Differentials. Institute for Research on Poverty. Discussion Paper no. 1110-96.
Rindermann, H., & Thompson, J. (2013). Ability rise in NAEP and narrowing ethnic gaps?. Intelligence, 41(6), 821-831.
Sackett PR, Shen W. (2010). Subgroup differences on cognitive tests in contexts other than personnel selection. In Outtz J (Ed.), Adverse impact: Implications for organizational staffing and high stakes selection (pp. 323–346). New York, NY: Routledge.

This entry was posted in IQ Survey & Analysis and tagged . Bookmark the permalink.