Why a Crowd-Sourced Peer-Review System Would Be Good for Philosophy (guest post)

We then provide three arguments using this theorem that a crowd-sourced peer-review system is likely to result in more reliable group judgments of paper quality than journal review. We argue that this follows irrespective of whether the crowd-sourced system involves (1) binary judgments (i.e. paper X is good/not good), (2) reviewer scores (i.e. evaluating papers on some scale, i.e. 1-100), and (3) qualitative reasons given by reviewers. Since peer review at journals standardly utilizes one or more of these measures of quality—as reviewers may be asked to render an overall judgment on a paper (accept/reject), rate a paper numerically, or write qualitative reviewer reports—it follows that a crowd-sourced peer-review system is likely to better evaluate paper quality than journal review.
Would “an online, crowd-sourced peer-review system” work better than traditional peer-review as a “quality control device” in philosophy? In a paper forthcoming in The British Journal for the Philosophy of Science, three philosophers, Marcus Arvan (Tampa), Liam Kofi Bright (LSE), and Remco Heesen (Western Australia), argue for a positive answer to this question.

Why a Crowd-Sourced Peer-Review System Would Be Good for Philosophy
by Marcus Arvan, Liam Kofi Bright, and Remco Heesen

We first argue that, if any form of peer review is to have any success at quality control, two conditions need to be satisfied. First, researchers in a given field must be competent at evaluating quality of research. Second, for a given paper there must be some intersubjective agreement (however broad or vague) on what constitutes quality appropriate for that paper. If either of these assumptions were false, then no system of peer review could perform the form of quality control commonly attributed to it.

Finally, we address a variety of objections, including logistical concerns about how an online, crowd-sourced system would work. First, we argue that ‘review bombing’ and trolling could be addressed in several ways, ranging from technological solutions (such as statistical software to detect and flag correlated votes) to human-based ones, including but not limited to initially anonymizing papers for some period of time, to the ability of reviewers or moderators to flag suspicious reviews, to two types of reviewers with separate reviewer scores: expert reviewers and general reviewers. Second, to the common objection that journals are likely to select more reliable reviewers than a crowd-based system would have—since journals (particularly selective ones) may be likely to select the most highly-established experts in a field as reviewers—we argue that a variety of findings cast doubt on this. Empirical studies on peer-review indicate that interrater reliability among journal reviewers is barely better than chance, and moreover, that journal review is disproportionately conservative, preferring ‘safe’ papers over more ambitious ones. We suggest a variety of reasons for this: journals have incentives to avoid false positives (publishing bad papers); reviewers and editors have incentives to reject papers given that the journal can only accept few papers; well-established researchers have reasons to be biased in favor of the status quo; and small groups of reviewers who publish in the same area and attend conferences together may be liable to groupthink. These speculations are backed up by numerous examples in a variety of fields—including philosophy, psychology, and economics—of influential or otherwise prestigious papers (including Nobel Prize winning economics papers) being systematically rejected by journals. We argue that, whatever biases will exist in a crowd-sourced model, they are likely to be distributed more randomly. Hence, the combined judgment of crowd-sourced reviewers will be more reliable on average, not less.
Based on these assumptions, we construct a series of arguments that a crowd-sourced approach is likely to evaluate the quality of academic research more reliably than traditional peer review. Our arguments are based on the Condorcet Jury Theorem, the famous mathematical finding that larger numbers of evaluators are far more likely to evaluate a proposition correctly than a smaller group. To see how, consider a jury of 100 people tasked with voting on whether p is true. Suppose that the average likelihood that any individual member will judge p rightly is slightly better than chance, or .51. Chances are that 51 members of the jury will vote correctly and 49 won’t. This means that it takes only one additional errant vote to tip the scales toward the majority judgment failing to evaluate p correctly—a probability of .38. Now consider a jury of 100,000. If the average jury member’s accuracy remains .51, then the most likely result is 51,000 jury members voting correctly and 49,000 incorrectly. This means that for the majority judgment to err, 1000 additional voters must err—which only occurs with a probability of one in ten billion. In short, the Condorcet theorem demonstrates that larger numbers of evaluators are more likely to correctly evaluate something as a group than a smaller number.
Next, we assume that a crowd-sourced peer-review system could be expected to have a higher average number of reviewers per paper than traditional peer review. This is plausible because the number of reviewers who evaluate a given paper in journal review is miniscule: papers submitted to journals are standardly evaluated by an editor or two at the ‘desk-reject’ stage, and if they pass this stage, they are normally sent to only one to three reviewers. We expect that an online, crowd-sourced system would involve many more people reviewing papers, particularly if a crowd-sourced peer-review website (built on top of preprint servers like arXiv or PhilPapers) incentivized reviewing.
In the following guest post,* they lay out some of the main considerations in favor of the idea that they discuss more fully in the paper itself, as well as address some objections to it.
Would a crowd-based peer-review system like we propose actually work in practice? Would enough people partake in it? Would reviews be thoughtful and evidence-based (reflecting reviewer competence) or incompetent? Could logistical problems (such as the kinds of things that have plagued Rottentomatoes.com) be overcome? We argue that answers to these questions cannot be settled a priori, but that there are a number of reasons to be optimistic. Finally, we offer suggestions for how to ‘beta’ (and later tweak) our proposal. Only time will tell, but we believe that what we currently lack are not good reasons for attempting to create such a forum—as our paper purports to show that there are good reasons to try. What we currently lack is the will to create such a system, and we hope that our paper contributes to building this will.
If we are correct, should peer review at journals disappear? We are agnostic about this (at least as a group), as the disciplines of math and physics combine crowd-sourced peer review with journal review. Given that some may be likely to remain skeptical of online reviews, we suspect that a Rottentomatoes-like crowd-sourced peer review site—perhaps housed at PhilPapers or here—might complement rather than supplant peer-reviewed journals, in broadly the way that math and physics currently do – a ‘best of both worlds’ approach. Indeed, it would be interesting to compare how the systems work concurrently.
Peer review is often thought to be an important form of quality control on academic research. But, assuming it is, what is the best form of peer review for this purpose? It appears to be widely assumed that peer review at academic journals is the best method. For example, hiring and tenure committees evaluate candidates on the basis of their publication record. But, is peer review at journals really the best method for evaluating quality? We argue not. Using the Condorcet Jury Theorem, we contend than an online, crowd-sourced peer-review system similar to what currently prevails in math and physics is likely to perform better as a quality control device than traditional peer review.