At my institution ‘benchmarking’ is a common practice. For large courses that have more than one teaching assistant, the course instructor will often plan a benchmarking session for each writing assignment, to be held before the TAs start grading. The TAs and course instructor will get together for an hour or so, read through a few of the students’ essays, discuss the merits and shortcomings of each one, and decide on the grade, or grade range, appropriate for the essay. In the course of my time as a teaching assistant and course instructor, I’ve heard people express the view that benchmarking is a waste of time, and I’ve also participated in benchmarking sessions in which the session was treated as an important and valuable exercise. I believe that benchmarking sessions are valuable, but not for the reasons usually given to justify the practice.
If this is right, then benchmarking is valuable, but not for the reason usually given.
In Defense of Benchmarking
by Julia Smith
The usual format for an undergraduate philosophy paper is to have students choose to write on one of several prompts, or to complete a writing assignment with a set ’scaffolding’. Among the student essays submitted in response to these kinds of assignments, it’s typical to see trends in the kinds of argumentative moves students make. Perhaps many of the submissions will discuss an objection that was of much interest in lecture. Perhaps many students will, on their own, come up with a particular objection that is informed by their background knowledge or common cultural assumptions (e.g. the ever-popular appeal to the subjectivity of taste, morals, or truth). Benchmarking is useful because it divides the cognitive labour of discerning what the common student strategies and pitfalls will be with respect to a particular assignment and provides a forum for discussing the philosophical merits of various specific moves that TAs will encounter as they grade.
This justification for benchmarking presupposes two things. First, it assumes that the benchmarking session will in fact be an effective way to better align graders’ standards. Second, it assumes that, in a large class with multiple graders, it is a desirable goal to eliminate any discrepancies between the standards of evaluation employed by different TAs. Both assumptions are doubtful.
Suppose, though, that relevant information were not limited in this way; suppose that each grader had read through all the student submissions prior to benchmarking. In that case, could the benchmarking session serve its intended purpose of calibration? I think it’s still unlikely. Graders’ standards of evaluation are functions of many factors. There are many good-making features of philosophical writing at the undergraduate level—accurate exposition, strong argumentation, good organization, clarity, creativity, originality, evidence of broad philosophical knowledge, following the assignment instructions, proper citation practices, etc.—and different graders might reasonably have different weightings of the relative importance of these features. While discussing a few student papers might raise (and dispatch) a few questions about how to weight some good-making features relative to one another, it’s not the case that discussing two or three papers in the span of an hour will be sufficient to answer all the questions about relative weightings that would need to be answered in order for different graders to converge on the same standard of evaluation.
[Donald Judd, chairs]
First: does benchmarking calibrate the graders? It’s far from clear that discussing two or three papers, read hastily—which is all that there is time for in a typical benchmarking session—will move TAs’ grading standards into greater alignment. Since benchmarking occurs before those participating in the benchmarking session have read any of the students’ submissions for the assignment, graders in a benchmarking session won’t be able to compare the papers chosen for the benchmarking session to other submissions. As all experienced graders know, the range in quality of student submissions for any given assignment can make a difference to the standard of evaluation that one uses. This means that the grading that is done in the benchmarking session is done in a vacuum; graders lack an important piece of contextual knowledge that would help them to determine what grade the papers under discussion ought to receive. After the benchmarking session, graders might reasonably determine that the grades decided on within the session were inaccurate in light of the information provided by reading a larger sample of student submissions—a fact that is, in my experience, often explicitly acknowledged by instructors in benchmarking sessions! It’s hard to see how deciding on grades in this context—where each grader might reasonably make their own changes to the grades that were decided on in the session—could effectively bring the respective graders’ standards into alignment.
I’ve given a couple reasons to think that benchmarking sessions don’t fulfill their stated purpose of calibrating the graders’ standards (or that they do so only poorly). Still, perhaps a benchmarking session that calibrates the graders’ standards imperfectly is preferable to no benchmarking session at all. Some progress towards calibration is surely better than none. This brings us to the second assumption behind the standard justification for benchmarking: that in a class with multiple graders, it’s desirable to eliminate discrepancies between graders’ standards. This assumption is presumably motivated by considerations of fairness. If graders employ different standards, then some students enrolled in the course will get lower grades than others simply because of who happened to grade their paper. If benchmarking prevents this from happening, then benchmarking corrects injustices in the grading process, and is for that reason desirable.
In the following guest post*, Julia Smith (University of Toronto) explains.
The standard justification for benchmarking is that the time spent collectively grading papers will serve to help ‘calibrate’ the respective grading scales of the graders. The thought is that prior to benchmarking, some graders will be disposed to be more exacting; others will be disposed to be more lenient. Benchmarking fixes this: collectively assigning grades together encourages the harsh graders and the easy graders to meet in the middle.
There’s a reason for instructors to meet with their teaching assistants to grade some sample assignments together, but it’s not what you think.
So, if benchmarking is not valuable because it helps calibrate graders’ standards, why is it valuable? In my opinion, it’s because of its role in distributing cognitive labour: benchmarking speeds up the process of identifying argumentative trends that graders will encounter when grading and gives them a forum to talk over the merits of these argumentative moves.
One might object that my argument for benchmarking is condescending to TAs. TAs are competent to assess the merits of various argumentative moves on their own; they don’t need extra help thinking through which arguments are good and which ones aren’t. This objection both underestimates the value of collaboration and fails to consider the constraints TAs have on their time. My view is that benchmarking is valuable not only for (less experienced) TAs, but for (more experienced) instructors. Everyone, no matter how competent or experienced, can come to better grasp and appreciate the merits of various argumentative moves through discussion. Part of the value of benchmarking is also that it saves time. In an ideal world, TAs would have ample time to read the set texts at their leisure. They would have time to map out the promising objections and possible replies, and they would have ample time to consider the merits of various argumentative moves. In reality, TAs often don’t have the luxury of extra time to devote to these tasks (i.e. they are not paid for this work), so an hour-long benchmarking session in which other minds help with this labour improves efficiency.