The Research about Student Evaluations of Teachers


We’ve discussed student evaluations of teachers here before, focusing on the various problems associated with them. Yet the picture may be more complicated. Elizabeth Barre, assistant director of Rice University’s Center for Teaching Excellence, recently posted about her “deep dive” into the voluminous research about student evaluations—research which is typically left undisturbed by news reporters writing about the “latest studies.” Here are what she reports as “the six most surprising insights I took away from the formal research literature on student evaluations “:

  1. Yes, there are studies that have shown no correlation (or even inverse correlations) between the results of student evaluations and student learning. Yet, there are just as many, and in fact many more, that show just the opposite.
  2. As with all social science, this research question is incredibly complex. And insofar as the research literature reflects this complexity, there are few straightforward answers to any questions. If you read anything that suggests otherwise (in either direction), be suspicious.
  3. Despite this complexity, there is wide agreement that a number of independent factors, easily but rarely controlled for, will bias the numerical results of an evaluation. These include, but are not limited to, student motivation, student effort, class size, and discipline (note that gender, grades, and workload are NOT included in this list).
  4. Even when we control for these known biases, the relationship between scores and student learning is not 1 to 1. Most studies have found correlations of around .5. This is a relatively strong positive correlation in the social sciences, but it is important to understand that it means there are still many factors influencing the outcome that we don’t yet understand. Put differently, student evaluations of teaching effectiveness are a useful, but ultimately imperfect, measure of teaching effectiveness.
  5. Despite this recognition, we have not yet been able to find an alternative measure of teaching effectiveness that correlates as strongly with student learning. In other words, they may be imperfect measures, but they are also our best measures.
  6. Finally, if scholars of evaluations agree on anything, they agree that however useful student evaluations might be, they will be made more useful when used in conjunction with other measures of teaching effectiveness.

The whole post is here. Following up on point #6, it would be interesting to hear what other measures of teaching effectiveness philosophy professors (and students) would like to see deployed.

 

guest
23 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Anon
Anon
6 years ago

I think the real lesson here is that social science is not a science or, to put it more nicely, a very, very soft science. It should never be used as the primary foundation for serious policy.

I note, for example, how this argument moves from
1. the studies show both sides are right
2. be very suspicious and cautious
3. business as usual, teaching evaluations good!Report

John
John
6 years ago

If only some physicist could take on an empirical theory of teaching evaluations, am I right?Report

Marcus Arvan
6 years ago

If it really is true that, controlling for biases, most studies show a .5 correlation between student evaluations and learning, that is bad news for the “student evaluations are meaningless” view. A .5 correlation is considered a very strong effect in the social sciences. Of course, this is a big “if.” Is Barre’s analysis of the literature correct?Report

HT
HT
6 years ago

Contradictory findings and unclarity are present in many (if not all) sciences, at some point or another, and a cautious approach is often warranted. So, jumping from the message in this one blog entry all the way to immediately discarding social sciences proper is a pretty big jump indeed.

So, the steps in your reasoning would be appreciated. For example, would you consider medical sciences (i.e., biopharmaceutical sciences, epidemiology, biomedical sciences, etc.) also to be not a proper science? If (any of those) medical sciences are not proper sciences – why not? Should we now not use them as ‘the primary foundation for serious policy’ in the medical and health domains (i.e. new cancer treatments, vaccination policies)? On the other hand, if you do consider them proper sciences, what’s the difference between them and social sciences which you do not consider proper?

What sets the “hard” sciences apart from the “soft” ones? When does one fall into either category? What can be done to improve the status of social sciences, if they are (as you claim) not proper sciences/soft?Report

HT
HT
6 years ago

Contradictory findings and unclarity are present in many (if not all) sciences, at some point or another, and a cautious approach is often warranted. So, jumping from the message in this one blog entry all the way to immediately discarding social sciences proper is a pretty big jump indeed.

So, the steps in your reasoning would be appreciated. For example, would you consider medical sciences (i.e., biopharmaceutical sciences, epidemiology, biomedical sciences, etc.) also to be not a proper science? If (any of those) medical sciences are not proper sciences – why not? Should we now not use them as ‘the primary foundation for serious policy’ in the medical and health domains (i.e. new cancer treatments, vaccination policies)? On the other hand, if you do consider them proper sciences, what’s the difference between them and social sciences which you do not consider proper?

What sets the “hard” sciences apart from the “soft” ones? When does one fall into either category? What can be done to improve the status of social sciences, if they are (as you claim) not proper sciences/soft?Report

Betsy Barre
Betsy Barre
6 years ago

It is a big “if,” so you’re right to question it. My post wasn’t an attempt to provide a systematic defense of the validity of evaluations (though I think that can be done). If I were doing that, I would have given you many more citations. This claim is the result found in numerous meta-analyses of various studies performed in the 80s. My favorite of those studies was performed by Cohen in 1981. I like this one because it included 41 “multi-section validity studies.” That is, it was a meta-analysis of 41 studies where students were assigned to different sections of the same course (often randomly) and then given common exams. In my mind, these types of studies are the gold standard methodologically speaking. And you can see the distribution of correlations in those specific studies here: https://twitter.com/elizabethabarre/status/619549458330554368 (it’s a screenshot from a talk I gave at Rice).

There are, however, two possible issues with this result.

First, these results are old. Insofar as physicists rely on old results (particularly once something has been found as consistently as these meta-analyses show), it’s not necessarily a problem for these results to be old. And because there haven’t been follow up meta-analyses using the same methods that undermine this result, I’m pretty confident that this result still holds. That said, one could (and some have) make the argument that unlike in physics, where physical laws don’t change very dramatically, today’s students are very different from students in the 70s. So perhaps the result holds for students from the 70s but not for students today. I’m skeptical of such generational arguments, but I admit it’s a possibility. I’d just like to see more studies (with the same methodology) to convince me it’s true.

Second, almost all of these studies compared the teaching effectiveness of faculty who lectured to their results on an exam that tested basic recall. One can reasonably ask whether teaching evals would correlate as strongly for those of us in philosophy who don’t lecture and whose learning outcomes are not primarily about recall. There have been recent studies on these classes, but again, not many comprehensive meta-analyses. So it’s hard to know.

The overall point I wanted to make, though, is that this is super complex. And the reporting on the subject doesn’t mirror that complexity.Report

Anon
Anon
6 years ago

“Despite this complexity, there is wide agreement that a number of independent factors, easily but rarely controlled for, will bias the numerical results of an evaluation. These include, but are not limited to, student motivation, student effort, class size, and discipline”

Haven’t the recent popular media articles been complaining about student evaluations as they are actually used–i.e., without such controls? The lesson here seems to be, not that the recent complaints are invalid, but that *there is a better way to use student evaluations*.Report

Anon
Anon
6 years ago

Interesting. Do you think these studies are better than studies that look for correlations with performance in subsequent courses (e.g., correlations between evals in Calc 1 and performance in Calc 2)?Report

Betsy Barre
Betsy Barre
6 years ago

Right. So this method is the newest rage, although it’s almost exclusively used by economists (not that this is *necessarily* a problem :)). It was the method that was used in the famous/controversial study from 2012 (?) that showed adjuncts at Northwestern were better teachers than their tenure-track faculty. But it is not actually measuring learning in terms of future *performance.* It’s measuring it in terms of future *grades.* And given what we know about the unreliability of grades as an indicator of performance, that distinction is important.

I admit to being intrigued by the method, and it is almost certainly better than relying upon current grades as a measure of learning. As I understand it, the idea is that if you use grades in future courses as an indicator, you’ll avoid some of the problems of current grades because 1) original faculty members don’t have control over those future grades and 2) students are more likely to be randomly distributed across easy and hard graders for the next class.

This is cool, and given that it’s harder and harder to recreate the kind of multi-section studies highlighted above (because we are–with good reason–moving away from massive intro courses taught in different sections with common exams), it may be our best option. But I’m still more likely to trust objective measures of performance (i.e., exams), rather than grades … *even if* it’s a future exam. I guess what I’d really like to see are more studies comparing these two methods (so far, most of the argument has been that this future grade method is far better than current grade studies, which is, well … obvious).

It’s also worth noting that a few of the most recent studies that show an inverse correlation between evals and “learning” use this method. But there have only been a handful of these (in part because it’s such a tricky method to pull off), compared to the hundreds of other studies.

So, for me, the jury is still out. But if I have to place bets, I’m still going with the .5 correlation.Report

Betsy Barre
Betsy Barre
6 years ago

^That should be: *even if* it’s a future GRADE (not exam). Sorry!Report

Anon
Anon
6 years ago

Thanks for the reply! I think future exams would be better than future grades (in appropriate courses), provided that the exams cover *new material and new approaches* not covered in the original course, and provided that the content of the future exams not be known to the instructor of the original course. The main advantage that using subsequent *anything* has over using end-of-course exams is that it helps you to measure a student’s general ability with a certain subject matter–including, e.g., his/her ability to process *new* material in that general subject matter, and his/her ability to understand *new* ways of approaching that subject matter.Report

JH
JH
6 years ago

I would recommend reading Linda Nilson’s paper “Time to Raise Questions About Student Ratings” for a good rebuttal to some of Barre’s claims in the comment section. Nilson points out that most of the research that favors student evaluations was conducted in the 1970s and 80s and presents evidence that students have changed since then–students are now more consumerist and have more instrumental attitudes toward higher education (students have probably always been like this, but they are now relatively more instrumentalist about these things). She also argues that little recent research on student evaluations finds a strong positive correlation between learning and evaluations. So, student evaluations are more biased than they were in the past. She notes that a recent meta-analysis “could not locate a single study documenting a posiitve relationship between student learning and student ratings that was published after 1990.” You can read part of her essay here.Report

Greg
Greg
6 years ago

I was going to mention Linda Nilson’s work as well, but JH beat me to it. And if you want to read a shorter, 3-paragraph version of her view, instead of the whole paper, you can look at the post she made here:
https://listserv.nd.edu/cgi-bin/wa?A2=ind1306&L=POD&F=&S=&P=34370
(And there is a small bibliography there, too.)Report

Jonathan Livengood
Jonathan Livengood
6 years ago

Just two quick thoughts.

First, it’s not at all clear that using Pearson’s r as a measure of correlation is reliable when looking at student evaluations, since student evaluation scores probably do not have interval scale. Some kind of rank-based measure of correlation would be better. (For similar reasons, some statisticians have complained about the use of simple averages in assessing the quality of instructors. See: https://www.stat.berkeley.edu/~stark/Preprints/evaluations14.pdf) So, I’m suspicious of the 0.5 claim.

Second, if we have a reliable measure of learning, then we ought to use that, instead of student evaluations. But suppose the situation is like this: we have a reliable measure of learning for *some* kinds of course but not for others; and we observe a relationship between evaluations and learning for those courses where we have a reliable measure of learning. The temptation is to infer that the relationship also holds for those courses where we don’t have a good measure of learning. If so, then we could use student evaluations to measure learning indirectly. But I’m not sure the inference is warranted. We already know that there is a serious, important difference between the two kinds of course, since we only have a reliable measure of learning for one of them.Report

Neil
Neil
6 years ago

There’s another reason to be suspicious of old social science. In psychology, effect sizes for longstanding paradigms are typically much smaller today than previously. The most likely explanation is that research is far more rigorous in a variety of ways. If this is true in other social sciences too, we should take research this old with a whole sack of salt.Report

John Turri
6 years ago

You think that’s a lesson that can be learned from a blog post?Report

John Turri
6 years ago

Sorry, I intended my question to appear as a reply to Anon 10:07 am. Allow me to rephrase:

Anon 10:37,
You think that’s a lesson that can be learned from a blog post?Report

Scu
Scu
6 years ago

I just wanted to highlight something that Professor Barre makes in the video linked to in her blog post, which is there does seem to be some relationship between gender and student evals, when the professor has a different gender from their students. So, a female professor in a classroom with a lot of male students (which might happen more frequently in philosophy) could have lower evaluations.

And if you are still reading and responding, Professor Barre, correct me if I misunderstood that. Also, I was curious about any data about bias with race and student evaluations. Thanks for doing this work.Report

philosophami
6 years ago

The assumption in (1) is that more studies is “more better” evidence. Although the article hints at it (in point 2) it might be worth pointing out what is obvious: quality of evidence (size of study, controls etc) should be weighed much more heavily than quantity. Lots of crappy studies don’t trump one high quality study. In order to properly evaluate the respective evidence supporting either position we’d have to know the quality of the studies included for analysis.Report

Betsy Barre
Betsy Barre
6 years ago

Hi, all. I’ve been trying to sit out for a bit to avoid over-dominating this discussion.

Thanks to all of you for your comments. At the end of the day, the primary point of my post was to encourage precisely this sort of conversation. I’ve become convinced that there is at least room for reasonable disagreement here, but too often we talk about this issue as if it’s obvious (whether we are cheerleaders for evaluations or think they are worthless). And that’s what I want to change.

As for the question asked about gender, yes, you’re reading/hearing me right on that point. Gender is still an issue that is being studied quite a bit, though (i.e., there is even more controversy about gender than general validity). And one thing to be said about the mountains of literature on student evaluations is that almost none of it covers the qualitative comments. And it seems to me that the qualitative comments are where most of the gender problems creep in (at least judging from the Rate My Professor visualization and my own experience; actually, someone did a study of the numerical rankings in RMP and found no discernible difference between the numbers on the basis of gender, so that confirms my sense that quant and qual are somehow different. For more on my completely unscientific hypothesis about this, check out Matt Reed’s column in Inside Higher Ed from this morning). There was a rather two part meta-analysis on gender completed by Feldman that showed, if anything, women have slightly higher scores. But the study was also awhile ago, so the same concerns raised by Nilson (if you think they are compelling) would apply here.

As for race, sadly, there haven’t been many studies yet. So the jury is still out there!Report

Dept Chair
Dept Chair
6 years ago

#20– the qualitative/quantitative point is really interesting. At my institution we did some statistical work on our evaluations, separating by (claimed) gender of student and professor, among other things. People walked into the meeting convinced there would be big gender biases, but the difference in median overall rating was very tiny, something like .01 on a 5 point scale. Differences in written comments would help explain the discrepancy.

Anyway, thanks for doing this work. It’s fascinating and important.Report

Kathleen Lowrey
Kathleen Lowrey
6 years ago

I’m not a statistician so I don’t know how to phrase this question effectively, but something I’ve always wondered about student evaluations is the distribution of results. I am persuaded that “all your student evaluations are terrible” means that you are, probably, a terrible teacher and “all of your student evaluations are glowing” means that you are, probably, a pretty great teacher. But how does everybody else fall, and how should we interpret that? What I mean is, let’s say that there were a perfectly even fractioning of results (on a scale of 1 – 5, each step receives about 20% of profs). I doubt that is the case, and would be very puzzled if it were, though if it did happen that way I think it would make the way evaluations are often used in practice (as effective rankings of relative merit) more logical.
What I guess happens is that there are a small number of 1s and 2s, and a small number of 5s, and a lot of 3s and 4s (from everything I have read, in general student evaluations skew rather generously, overall). So then if you bore down only on 3s and 4s, how do you robustly distinguish among complex effects, all of which we know play a role? Gender, race, student preparedness and interest, class size, but also, contextual expectations — “this professor more or less seemed fine and I think you are supposed to give more or less fine professors either a 3 or a 4”. Finely calibrating out all of those, how would it be possible to then suss out the portion that correlates (or does not) to “actual teacher effectiveness at teaching”? If the vast majority of professors are 3s or 4s, the factors that make them 3s or 4s are so multiply-determined that studies might be able to really make a case for anything about them. Which is bad when these are used to straightforwardly rank teaching, as they often are — to say a 4 teacher is a more effective teacher than a 3 teacher, which really might not be true at all, when almost everyone clusters at 3 and 4 for a huge number of overlapping reasons. Have any studies, though, looked only at the extremes? This seems like it would be far more useful, both to suss out “what to do in the classroom” and “what not to do in the classroom” and also to really find evidence of bias and detect correlation (or not) to learning.Report

Kathleen Lowrey
Kathleen Lowrey
6 years ago

one more question I don’t know how to phrase well, but which occurs to me — about teaching evaluations being treated as measures of something like tallness (my height does not affect your height) & student reports about it are thus always independent of one another (how tall or short was this guy? Pretty tall, pretty short, very tall, very short, etc.) and absolute. But what if university teaching is more like an ecosystem, where there are lots of ambient influences and expectations — I don’t mean just about “the pinnacle course is taught by a guy with a pipe and elbow patches” or “by a zany mad male scientist” or “by a firebrand feminist during my rebellious years”, though that will play a role — but also in terms of students seeing their evaluations not as independent but more like “I am saving my 1s and 5s for special occasions and I expect mostly to experience 3s and 4s” such that no matter what you do about “student engagement” and “off with bad proffies’ heads” and “centres for the improvement of teaching and learning” the distribution of results is going to be the same until the end of time?Report