Sunday, August 7, 2016

It's not the science that is junk, it's the measures, Part II

So a day or so after my last post, It's not the science that is junk, it's the measures, I came across this interview of Jesse Rothstein by Rachel Cohen in the American Prospect. There's lots of good stuff in there and it's worth reading. I don't mean to take away from the import of Jesse Rothstein's work (I am a big fan of his work and of Rachel Cohen's work) but a piece of it kind of demonstrates what I was trying to get at in my last post.

Talking about VAM, Rothstein said,
It’s very controversial and I’ve argued that one of the flaws of it is that even though VAM shows the average growth of a teacher’s student, that’s not the same thing as showing a teacher’s effect, because teachers teach very different groups of students. 
If I’m a teacher who is known to be really good with students with attention-deficit disorder, and all those kids get put in my class, they don’t, on average, gain as much as other students, and I look less effective. But that might be because I was systematically given the kids who wouldn’t gain very much.
So, yes, this is a very good point: there is a difference between showing "the growth of a teacher's student" and "showing a teacher's effect."  And yes, according to test scores, and how well students perform on them, teachers can look more effective or less effective, regardless of how good they are at teaching.

The he says, when she asks if he is skeptical of VAM,
I think the metrics are not as good as the plaintiffs made them out to be. There are bias issues, among others. One big issue is that evaluating teachers based on value-added encourages teachers to teach to the state test. 
During the Vergara trials you testified against some of Harvard economist Raj Chetty's VAM research, and the two of you have been going back and forth ever since. Can you describe what you two are arguing about?  
Raj’s testimony at the trial was very focused on his work regarding teacher VAM. After the trial, I really dug in to understand his work, and I probed into some of his assumptions, and found that they didn’t really hold up. So while he was arguing that VAM showed unbiased results, and VAM results tell you a lot about a teacher’s long-term outcomes, I concluded that what his approach really showed was that value-added scores are moderately biased, and that they don’t really tell us one way or another about a teacher’s long-term outcomes.
If you look at this response and then go back to the previous one I pulled out, you see that Rothstein is referencing "growth" and then "bias." That certain types of students won't "gain as much as other students" and that the value-added scores are "moderately biased" and that they don't tell us much about a teacher's "long-term outcomes."

Nowhere in there is there a repudiation of the measures, of the tests themselves, or even a question about their validity. His responses seem to assume that determining a teacher's effectiveness according to test scores is unfair because some students won't perform on them and that these tests can show growth and gains in learning.  Nowhere does he question that the tests themselves might not be reflective of real learning, good teaching, or of quality education.

And then the bias and assumptions critique, that has to do with the model, and not with what is being fed into the model, i.e., test scores. Arguments about the strength of statistical models are worth having but those should start with probing what's being fed into them.

If someone like Jesse Rothstein isn't questioning that, then test-based accountability isn't going away anytime soon. It will forever be a matter of tinkering with models.

Wednesday, August 3, 2016

It's not the science that is junk, it's the measures

So I recently had occasion to read a whole bunch of studies on charter schools and one type I read was about their effectiveness. I read the CREDO studies and I read critiques of the CREDO studies and I read meta-analyses and I read smaller studies.

Anyway, I want to go back to something I used to say and that I have heard others who are similarly skeptical of Big Ed Reform, and that is the notion of "junk science." A lot of us have called VAM and have called other studies of educational effectiveness "junk science." I know I did, indignantly. But you know what? I didn't really know what I was saying. (This is one reason I went back to get my PhD, so I would have more understanding of these kinds of things.)

And I was reading all of these studies on the effectiveness of charter schools, I remembered reading this post by Matt DiCarlo on the Shanker Blog from over 3 years ago. I remembered that reading it gave me pause about calling what I did "junk science" and I ceased doing so, but even so, I couldn't fully relate to what he was saying:
Now, I personally am not opposed to using these estimates in evaluations and other personnel policies, but I certainly understand opponents’ skepticism. For one thing, there are some states and districts in which design and implementation has been somewhat careless, and, in these situations, I very much share the skepticism. Moreover, the common argument that evaluations, in order to be "meaningful," must consist of value-added measures in a heavily-weighted role (e.g., 45-50 percent) is, in my view, unsupportable. 
All that said, calling value-added “junk science” completely obscures the important issues. The real questions here are less about the merits of the models per se than how they're being used. 
If value-added is “junk science” regardless of how it's employed, then a fairly large chunk of social scientific research is “junk science." If that’s your opinion, then okay – you’re entitled to it – but it’s not very compelling, at least in my (admittedly biased) view.
I am still no statistics expert and I never will be, but I have a much greater appreciation for what these models and analyses can tell us and what they don't tell us and what their limitations are. And these researchers conducting these studies, they may have different ways of conducting the studies and different opinions regarding which factors should be included and which shouldn't, but they know what they're doing, most of them at least, and they go to great pains to be thorough and thoughtful about their design and methodology and to explain the models they're using and to account for the results that these models produce. So the problem is not with the science.

DiCarlo says the problem is in how the models are being used. Yes. But another problem, as far as I could glean, is with the measures they're using. "Student learning” and “student achievement” have come to be represented by test scores. That is not my currency of educational quality, but it is the current currency in educational research and policy. I think many of these tests are of dubious quality and I don't think that they provide a true measure of what students have actually learned and or of the quality of their educational experience. Richer, deeper, more authentic student learning in charter schools, and schools in general, can be measured if we think creatively and holistically about it. But we're not doing that and we're not incentivized to do that. So much of the money for educational research, so much of the recognition, goes to researchers who use these test scores as measures. Because there's not much else. Even researchers who don't agree that they are good measures will say as much in one paragraph and then cite them as evidence of effectiveness or lack thereof in the next. 

To me, it's kind of like chicken nuggets and milkshakes. McFastFood place has a sound process for making chicken nuggets and milkshakes, but once all is said and done, how much actual quality chicken meat and milk come out of the other side? How much actual nutrition? How much actual, recognizable learning and educational quality gets funneled through these tests and comes out of the other side of these statistical analyses that use test scores as measures?

I doubt much.