In Defense of Benchmarking (guest post)
There’s a reason for instructors to meet with their teaching assistants to grade some sample assignments together, but it’s not what you think.
In the following guest post*, Julia Smith (University of Toronto) explains.
In Defense of Benchmarking
by Julia Smith
At my institution ‘benchmarking’ is a common practice. For large courses that have more than one teaching assistant, the course instructor will often plan a benchmarking session for each writing assignment, to be held before the TAs start grading. The TAs and course instructor will get together for an hour or so, read through a few of the students’ essays, discuss the merits and shortcomings of each one, and decide on the grade, or grade range, appropriate for the essay. In the course of my time as a teaching assistant and course instructor, I’ve heard people express the view that benchmarking is a waste of time, and I’ve also participated in benchmarking sessions in which the session was treated as an important and valuable exercise. I believe that benchmarking sessions are valuable, but not for the reasons usually given to justify the practice.
The standard justification for benchmarking is that the time spent collectively grading papers will serve to help ‘calibrate’ the respective grading scales of the graders. The thought is that prior to benchmarking, some graders will be disposed to be more exacting; others will be disposed to be more lenient. Benchmarking fixes this: collectively assigning grades together encourages the harsh graders and the easy graders to meet in the middle.
This justification for benchmarking presupposes two things. First, it assumes that the benchmarking session will in fact be an effective way to better align graders’ standards. Second, it assumes that, in a large class with multiple graders, it is a desirable goal to eliminate any discrepancies between the standards of evaluation employed by different TAs. Both assumptions are doubtful.
First: does benchmarking calibrate the graders? It’s far from clear that discussing two or three papers, read hastily—which is all that there is time for in a typical benchmarking session—will move TAs’ grading standards into greater alignment. Since benchmarking occurs before those participating in the benchmarking session have read any of the students’ submissions for the assignment, graders in a benchmarking session won’t be able to compare the papers chosen for the benchmarking session to other submissions. As all experienced graders know, the range in quality of student submissions for any given assignment can make a difference to the standard of evaluation that one uses. This means that the grading that is done in the benchmarking session is done in a vacuum; graders lack an important piece of contextual knowledge that would help them to determine what grade the papers under discussion ought to receive. After the benchmarking session, graders might reasonably determine that the grades decided on within the session were inaccurate in light of the information provided by reading a larger sample of student submissions—a fact that is, in my experience, often explicitly acknowledged by instructors in benchmarking sessions! It’s hard to see how deciding on grades in this context—where each grader might reasonably make their own changes to the grades that were decided on in the session—could effectively bring the respective graders’ standards into alignment.
Suppose, though, that relevant information were not limited in this way; suppose that each grader had read through all the student submissions prior to benchmarking. In that case, could the benchmarking session serve its intended purpose of calibration? I think it’s still unlikely. Graders’ standards of evaluation are functions of many factors. There are many good-making features of philosophical writing at the undergraduate level—accurate exposition, strong argumentation, good organization, clarity, creativity, originality, evidence of broad philosophical knowledge, following the assignment instructions, proper citation practices, etc.—and different graders might reasonably have different weightings of the relative importance of these features. While discussing a few student papers might raise (and dispatch) a few questions about how to weight some good-making features relative to one another, it’s not the case that discussing two or three papers in the span of an hour will be sufficient to answer all the questions about relative weightings that would need to be answered in order for different graders to converge on the same standard of evaluation.
I’ve given a couple reasons to think that benchmarking sessions don’t fulfill their stated purpose of calibrating the graders’ standards (or that they do so only poorly). Still, perhaps a benchmarking session that calibrates the graders’ standards imperfectly is preferable to no benchmarking session at all. Some progress towards calibration is surely better than none. This brings us to the second assumption behind the standard justification for benchmarking: that in a class with multiple graders, it’s desirable to eliminate discrepancies between graders’ standards. This assumption is presumably motivated by considerations of fairness. If graders employ different standards, then some students enrolled in the course will get lower grades than others simply because of who happened to grade their paper. If benchmarking prevents this from happening, then benchmarking corrects injustices in the grading process, and is for that reason desirable.
Not everyone finds this reasoning persuasive, because not everyone thinks it’s unfair to have TAs employ different grading standards. It is pedagogically important that what students hear from the course instructor is consistent with what they hear from their teaching assistant; when instructors and TAs (appear to) give conflicting information or advice, the experience can be extremely frustrating and disorienting for students. TAs should do their best to implement any grading guidelines or rubrics that are provided by the course instructor and to amplify any information from the instructor about how to complete assignments. But in addition to doing these things, good TAs will often provide writing instruction and guidance that goes beyond what the students hear in lecture. Indeed, in the various TA training events I’ve attended at my institution and within my department, the unique role tutorials play in providing students with discipline-specific writing instruction is often emphasized. While we can expect that every TA will be reiterating in tutorial the assignment instructions provided by the instructor, different TAs may well emphasize different things when it comes to the additional writing support they provide for their students. One TA might have spent a lot of time having their students practice writing clear and simple prose, while another TA might have emphasized the importance of developing a philosophical dialectic in one’s essay. When it comes time for these TAs to grade the submissions they’ve received from their students, what could be more appropriate than to allow them to emphasize in their feedback to students the specific elements of philosophical writing that they’ve been working on in tutorial? This will mean that different TAs will weight different good-making features of philosophical writing differently, but that’s okay: there’s no single right way to write a philosophy paper, or to evaluate one. Far from being unfair, allowing different TAs to implement different standards of evaluation is a good practice, so long as the TA’s grading standards do not conflict with expectations laid out by the course instructor, and so long as they are not unfair in other ways. This reasoning suggests that it’s misguided to spend time trying to promote the goal of eliminating discrepancies in graders’ standards.
So, if benchmarking is not valuable because it helps calibrate graders’ standards, why is it valuable? In my opinion, it’s because of its role in distributing cognitive labour: benchmarking speeds up the process of identifying argumentative trends that graders will encounter when grading and gives them a forum to talk over the merits of these argumentative moves.
The usual format for an undergraduate philosophy paper is to have students choose to write on one of several prompts, or to complete a writing assignment with a set ’scaffolding’. Among the student essays submitted in response to these kinds of assignments, it’s typical to see trends in the kinds of argumentative moves students make. Perhaps many of the submissions will discuss an objection that was of much interest in lecture. Perhaps many students will, on their own, come up with a particular objection that is informed by their background knowledge or common cultural assumptions (e.g. the ever-popular appeal to the subjectivity of taste, morals, or truth). Benchmarking is useful because it divides the cognitive labour of discerning what the common student strategies and pitfalls will be with respect to a particular assignment and provides a forum for discussing the philosophical merits of various specific moves that TAs will encounter as they grade.
One might object that my argument for benchmarking is condescending to TAs. TAs are competent to assess the merits of various argumentative moves on their own; they don’t need extra help thinking through which arguments are good and which ones aren’t. This objection both underestimates the value of collaboration and fails to consider the constraints TAs have on their time. My view is that benchmarking is valuable not only for (less experienced) TAs, but for (more experienced) instructors. Everyone, no matter how competent or experienced, can come to better grasp and appreciate the merits of various argumentative moves through discussion. Part of the value of benchmarking is also that it saves time. In an ideal world, TAs would have ample time to read the set texts at their leisure. They would have time to map out the promising objections and possible replies, and they would have ample time to consider the merits of various argumentative moves. In reality, TAs often don’t have the luxury of extra time to devote to these tasks (i.e. they are not paid for this work), so an hour-long benchmarking session in which other minds help with this labour improves efficiency.
If this is right, then benchmarking is valuable, but not for the reason usually given.
I’m a bit fan of benchmarking (alongside other moderating activities). From personal experience, one real advantage of this that is missed applies to TAs who, whilst experienced and capable philosophers, are new to a particular institution. A lot of institutions do have subtly (and even not so subtly) different priorities when marking, making the point OP makes about how we weigh different essay virtues even more important.
This is especially important when it comes to international TAs who did their undergraduate or MA study in a different country.
Using my own example of moving from the UK to Europe, I found that students are expected to write quite a bit more (both in terms of numbers of papers and in terms of length of each paper). Consequently, whereas brevity is reasonably highly valued in the UK (at least in my experience, though typically as a vehicle for clarity), it is not quite so essential where I am now (though clearly not unimportant). Conversely, a great deal more breadth (though not necessarily depth) is expected of students, sooner (again, at least in my limited experience). I know that benchmarking for a couple of courses in the first semester I taught helped me make the adjustments to a new institutional culture!Report
I had a similar experience moving from Canada to North America.Report
I thought of Canada as part of North America … the top part … Am I wrong?Report
I was poking fun at the OP: “Using my own example of moving from the UK to Europe…”Report
Thanks, Gareth. I agree that it is important for TAs to understand the priorities of the institution or academic culture within which they are working. But I think that communicating these norms would be more efficiently done in a general training session for new TAs. My post is meant to defend the value of holding benchmarking sessions for particular assignments *in addition to* whatever general training the TAs receive at the start of the academic year.Report
Thanks, Julia! This is an interesting justification. I wonder how it bears on the ways instructors should identify the assignments to be used in benchmarking sessions. When I’ve held them in the past, I’ve skimmed just enough submissions to identify one that looked good, one that looked bad, and perhaps a third that looked middling. It was a pretty quick and painless selection process. But if the justification for benchmarking is instead to enable graders to identify common student arguments and pitfalls, perhaps I should have been giving the whole batch of submissions — or at least a significantly larger chunk of it — a preliminary read, in order to discern which arguments/pitfalls really are the common ones. But in the courses where benchmarking seems most valuable (ones large enough to have multiple graders), that would make it more practically onerous to conduct. Any thoughts about how to do this efficiently, but in a way that preserves the point of the exercise?Report
Thanks for your question, Griffin! The task of selecting papers for the benchmarking session will, I think, be easier the more experienced an instructor one is. If you’re teaching the same course for a second (or third, or fourth) time and use the same or similar essay prompts, you’ll probably have some sense of what kinds of argumentative moves the students are likely to make before looking at any papers. When teaching a course for the first time, I suspect that the benchmarking session will be more valuable when the instructor invests more time in skimming a larger batch of papers and selecting ones that are representative of argumentative trends.Report