In Defense of Benchmarking (guest post)

At my institution ‘benchmarking’ is a common practice. For large courses that have more than one teaching assistant, the course instructor will often plan a benchmarking session for each writing assignment, to be held before the TAs start grading. The TAs and course instructor will get together for an hour or so, read through a few of the students’ essays, discuss the merits and shortcomings of each one, and decide on the grade, or grade range, appropriate for the essay. In the course of my time as a teaching assistant and course instructor, I’ve heard people express the view that benchmarking is a waste of time, and I’ve also participated in benchmarking sessions in which the session was treated as an important and valuable exercise. I believe that benchmarking sessions are valuable, but not for the reasons usually given to justify the practice.
If this is right, then benchmarking is valuable, but not for the reason usually given.

Not everyone finds this reasoning persuasive, because not everyone thinks it’s unfair to have TAs employ different grading standards. It is pedagogically important that what students hear from the course instructor is consistent with what they hear from their teaching assistant; when instructors and TAs (appear to) give conflicting information or advice, the experience can be extremely frustrating and disorienting for students. TAs should do their best to implement any grading guidelines or rubrics that are provided by the course instructor and to amplify any information from the instructor about how to complete assignments. But in addition to doing these things, good TAs will often provide writing instruction and guidance that goes beyond what the students hear in lecture. Indeed, in the various TA training events I’ve attended at my institution and within my department, the unique role tutorials play in providing students with discipline-specific writing instruction is often emphasized. While we can expect that every TA will be reiterating in tutorial the assignment instructions provided by the instructor, different TAs may well emphasize different things when it comes to the additional writing support they provide for their students. One TA might have spent a lot of time having their students practice writing clear and simple prose, while another TA might have emphasized the importance of developing a philosophical dialectic in one’s essay. When it comes time for these TAs to grade the submissions they’ve received from their students, what could be more appropriate than to allow them to emphasize in their feedback to students the specific elements of philosophical writing that they’ve been working on in tutorial? This will mean that different TAs will weight different good-making features of philosophical writing differently, but that’s okay: there’s no single right way to write a philosophy paper, or to evaluate one. Far from being unfair, allowing different TAs to implement different standards of evaluation is a good practice, so long as the TA’s grading standards do not conflict with expectations laid out by the course instructor, and so long as they are not unfair in other ways. This reasoning suggests that it’s misguided to spend time trying to promote the goal of eliminating discrepancies in graders’ standards.

In Defense of Benchmarking
by Julia Smith

The usual format for an undergraduate philosophy paper is to have students choose to write on one of several prompts, or to complete a writing assignment with a set ’scaffolding’. Among the student essays submitted in response to these kinds of assignments, it’s typical to see trends in the kinds of argumentative moves students make. Perhaps many of the submissions will discuss an objection that was of much interest in lecture. Perhaps many students will, on their own, come up with a particular objection that is informed by their background knowledge or common cultural assumptions (e.g. the ever-popular appeal to the subjectivity of taste, morals, or truth). Benchmarking is useful because it divides the cognitive labour of discerning what the common student strategies and pitfalls will be with respect to a particular assignment and provides a forum for discussing the philosophical merits of various specific moves that TAs will encounter as they grade.
This justification for benchmarking presupposes two things. First, it assumes that the benchmarking session will in fact be an effective way to better align graders’ standards. Second, it assumes that, in a large class with multiple graders, it is a desirable goal to eliminate any discrepancies between the standards of evaluation employed by different TAs. Both assumptions are doubtful.
Suppose, though, that relevant information were not limited in this way; suppose that each grader had read through all the student submissions prior to benchmarking. In that case, could the benchmarking session serve its intended purpose of calibration? I think it’s still unlikely. Graders’ standards of evaluation are functions of many factors. There are many good-making features of philosophical writing at the undergraduate level—accurate exposition, strong argumentation, good organization, clarity, creativity, originality, evidence of broad philosophical knowledge, following the assignment instructions, proper citation practices, etc.—and different graders might reasonably have different weightings of the relative importance of these features. While discussing a few student papers might raise (and dispatch) a few questions about how to weight some good-making features relative to one another, it’s not the case that discussing two or three papers in the span of an hour will be sufficient to answer all the questions about relative weightings that would need to be answered in order for different graders to converge on the same standard of evaluation.
[Donald Judd, chairs]

First: does benchmarking calibrate the graders? It’s far from clear that discussing two or three papers, read hastily—which is all that there is time for in a typical benchmarking session—will move TAs’ grading standards into greater alignment. Since benchmarking occurs before those participating in the benchmarking session have read any of the students’ submissions for the assignment, graders in a benchmarking session won’t be able to compare the papers chosen for the benchmarking session to other submissions. As all experienced graders know, the range in quality of student submissions for any given assignment can make a difference to the standard of evaluation that one uses. This means that the grading that is done in the benchmarking session is done in a vacuum; graders lack an important piece of contextual knowledge that would help them to determine what grade the papers under discussion ought to receive. After the benchmarking session, graders might reasonably determine that the grades decided on within the session were inaccurate in light of the information provided by reading a larger sample of student submissions—a fact that is, in my experience, often explicitly acknowledged by instructors in benchmarking sessions! It’s hard to see how deciding on grades in this context—where each grader might reasonably make their own changes to the grades that were decided on in the session—could effectively bring the respective graders’ standards into alignment.
I’ve given a couple reasons to think that benchmarking sessions don’t fulfill their stated purpose of calibrating the graders’ standards (or that they do so only poorly). Still, perhaps a benchmarking session that calibrates the graders’ standards imperfectly is preferable to no benchmarking session at all. Some progress towards calibration is surely better than none. This brings us to the second assumption behind the standard justification for benchmarking: that in a class with multiple graders, it’s desirable to eliminate discrepancies between graders’ standards. This assumption is presumably motivated by considerations of fairness. If graders employ different standards, then some students enrolled in the course will get lower grades than others simply because of who happened to grade their paper. If benchmarking prevents this from happening, then benchmarking corrects injustices in the grading process, and is for that reason desirable.
In the following guest post*, Julia Smith (University of Toronto) explains.
The standard justification for benchmarking is that the time spent collectively grading papers will serve to help ‘calibrate’ the respective grading scales of the graders. The thought is that prior to benchmarking, some graders will be disposed to be more exacting; others will be disposed to be more lenient. Benchmarking fixes this: collectively assigning grades together encourages the harsh graders and the easy graders to meet in the middle.
There’s a reason for instructors to meet with their teaching assistants to grade some sample assignments together, but it’s not what you think.
So, if benchmarking is not valuable because it helps calibrate graders’ standards, why is it valuable? In my opinion, it’s because of its role in distributing cognitive labour: benchmarking speeds up the process of identifying argumentative trends that graders will encounter when grading and gives them a forum to talk over the merits of these argumentative moves.
One might object that my argument for benchmarking is condescending to TAs. TAs are competent to assess the merits of various argumentative moves on their own; they don’t need extra help thinking through which arguments are good and which ones aren’t. This objection both underestimates the value of collaboration and fails to consider the constraints TAs have on their time. My view is that benchmarking is valuable not only for (less experienced) TAs, but for (more experienced) instructors. Everyone, no matter how competent or experienced, can come to better grasp and appreciate the merits of various argumentative moves through discussion. Part of the value of benchmarking is also that it saves time. In an ideal world, TAs would have ample time to read the set texts at their leisure. They would have time to map out the promising objections and possible replies, and they would have ample time to consider the merits of various argumentative moves. In reality, TAs often don’t have the luxury of extra time to devote to these tasks (i.e. they are not paid for this work), so an hour-long benchmarking session in which other minds help with this labour improves efficiency.