The identification of recent anomalies in Texas’s STAAR tests has led a great many in both the education and policy communities to demand the Texas Education Agency (TEA) and its test contractor do a better job building the tests — in particular the reading tests. Evidence is cited that the reading tests may in fact target above grade-level reading, which, if true, would put TEA out of compliance with state law.
This moment when STAAR is making headlines offers an excellent opportunity to make clear what this particular type of test can and cannot do, and the degree to which Texas has asked it to carry a weight for which it was never designed.
First, the STAAR test is designed to be predictive: The scores can be expected to be consistent over time. When a change occurs, an inference can be made that something happened, and the change can be explored for causes. Only after a cause is known could we possibly pass a judgment, since prior to that point we have nothing to judge. This would be true for good and bad judgments at every test score.
To create that consistency requires the test maker to limit what gets tested. Lots of items are field tested and then only those that fit a narrow statistical window make it into the test, which is what creates that consistency. Any predictive test is really a small subset of a content area, but when that small subset correlates to the larger domain it can be useful. Researchers can use a small part of a domain they can see in a test score to make inferences about the larger domain.
Second, each STAAR score is an estimate only and must be treated as such. Because of how the tests are designed, the estimates won’t vary much if a student tests multiple times, but they will vary and there will be a few students for whom they will vary a great deal. Because they occur at a moment in time and attempt to reflect the efforts of an entire year, they cannot be expected to be entirely accurate in every case.
Third, a variety of interpretive lenses can be applied to such tests. A normative lens compares test takers to other test takers. A criterion-referenced lens compares test takers to a score to which additional meaning has been assigned. The criterion is also an estimate, as two different groups will likely put it at different places. Criterion scores needs to be validated or the interpretations risk being inappropriate.
Fourth, while STAAR may be a technically sound test, that does not mean it can be used however anyone likes. A test that is technically sound is one that goes to great lengths in regard to established best practices; the estimates should be useful, provided the use takes into account the test's limitations.
Compelling evidence suggests the STAAR tests miss the mark in several ways.
High-stakes decisions — especially those with negative consequences — should be viewed suspiciously if made through imprecise estimates, since they risk being wrong. Consider how wrong your judgments would be if they were assigned without clear evidence for making them, and how detrimental they would be to those being judged. Children will be at every possible test score for a variety of reasons. Once those reasons are known, some may be worthy of judgment, but some most certainly will not.
Evidence from several readability studies as well as data from other tests indicates that the STAAR reading passages are disproportionately above grade level. If the reading passages are found to be above grade level, that could negate the ability to make grade-level inferences. Note that solving this issue in no way resolves the other issues, but it could help clarify whether Texas has been mis-estimating or underestimating reading scores.
Common sense would suggest three actions:
First, analyze the STAAR reading tests to determine the degree to which the requirement for grade-level inferences (readability) may have been violated. If it was, TEA may have forced students to take tests for which they were not prepared. The agency regularly cites the decline in NAEP reading scores, and if it did place negative judgments where they were not warranted, that in itself could lead to schools being less effective, and could have contributed to that decline.
Second, place a two-year moratorium on any judgments being made from STAAR — including the assignment of school grades. School grades are almost entirely dependent on whether or not a student passes STAAR, which in turn is based on a set of estimates that are not nearly as precise as most people think. Add in the questions raised by those looking at reading levels, and a careful look seems more than justified. To satisfy federal requirements, students will still need to test and struggling schools will need to be identified, but use multiple measures and observations to do it and make the process positive and not punitive.
Third, use that two-year moratorium to realize that Texas’s test-addicted accountability systems are incapable of achieving their policy goals — not because tests are bad, but because they were not designed to do what has been asked of them. Texas deserves an accountability system that places student need at the center of the work. What we have in front of us is the first opportunity in a great while to do just that.