Why a Crowd-Sourced Peer-Review System Would Be Good for Philosophy (guest post)


Would “an online, crowd-sourced peer-review system” work better than traditional peer-review as a “quality control device” in philosophy? In a paper forthcoming in The British Journal for the Philosophy of Science, three philosophers, Marcus Arvan (Tampa), Liam Kofi Bright (LSE), and Remco Heesen (Western Australia), argue for a positive answer to this question.

In the following guest post,* they lay out some of the main considerations in favor of the idea that they discuss more fully in the paper itself, as well as address some objections to it.


Why a Crowd-Sourced Peer-Review System Would Be Good for Philosophy
by Marcus Arvan, Liam Kofi Bright, and Remco Heesen

Peer review is often thought to be an important form of quality control on academic research. But, assuming it is, what is the best form of peer review for this purpose? It appears to be widely assumed that peer review at academic journals is the best method. For example, hiring and tenure committees evaluate candidates on the basis of their publication record. But, is peer review at journals really the best method for evaluating quality? We argue not. Using the Condorcet Jury Theorem, we contend than an online, crowd-sourced peer-review system similar to what currently prevails in math and physics is likely to perform better as a quality control device than traditional peer review.

We first argue that, if any form of peer review is to have any success at quality control, two conditions need to be satisfied. First, researchers in a given field must be competent at evaluating quality of research. Second, for a given paper there must be some intersubjective agreement (however broad or vague) on what constitutes quality appropriate for that paper. If either of these assumptions were false, then no system of peer review could perform the form of quality control commonly attributed to it.

Next, we assume that a crowd-sourced peer-review system could be expected to have a higher average number of reviewers per paper than traditional peer review. This is plausible because the number of reviewers who evaluate a given paper in journal review is miniscule: papers submitted to journals are standardly evaluated by an editor or two at the ‘desk-reject’ stage, and if they pass this stage, they are normally sent to only one to three reviewers. We expect that an online, crowd-sourced system would involve many more people reviewing papers, particularly if a crowd-sourced peer-review website (built on top of preprint servers like arXiv or PhilPapers) incentivized reviewing.

Based on these assumptions, we construct a series of arguments that a crowd-sourced approach is likely to evaluate the quality of academic research more reliably than traditional peer review. Our arguments are based on the Condorcet Jury Theorem, the famous mathematical finding that larger numbers of evaluators are far more likely to evaluate a proposition correctly than a smaller group. To see how, consider a jury of 100 people tasked with voting on whether p is true. Suppose that the average likelihood that any individual member will judge p rightly is slightly better than chance, or .51. Chances are that 51 members of the jury will vote correctly and 49 won’t. This means that it takes only one additional errant vote to tip the scales toward the majority judgment failing to evaluate p correctly—a probability of .38. Now consider a jury of 100,000. If the average jury member’s accuracy remains .51, then the most likely result is 51,000 jury members voting correctly and 49,000 incorrectly. This means that for the majority judgment to err, 1000 additional voters must err—which only occurs with a probability of one in ten billion. In short, the Condorcet theorem demonstrates that larger numbers of evaluators are more likely to correctly evaluate something as a group than a smaller number.

We then provide three arguments using this theorem that a crowd-sourced peer-review system is likely to result in more reliable group judgments of paper quality than journal review. We argue that this follows irrespective of whether the crowd-sourced system involves (1) binary judgments (i.e. paper X is good/not good), (2) reviewer scores (i.e. evaluating papers on some scale, i.e. 1-100), and (3) qualitative reasons given by reviewers. Since peer review at journals standardly utilizes one or more of these measures of quality—as reviewers may be asked to render an overall judgment on a paper (accept/reject), rate a paper numerically, or write qualitative reviewer reports—it follows that a crowd-sourced peer-review system is likely to better evaluate paper quality than journal review.

Finally, we address a variety of objections, including logistical concerns about how an online, crowd-sourced system would work. First, we argue that ‘review bombing’ and trolling could be addressed in several ways, ranging from technological solutions (such as statistical software to detect and flag correlated votes) to human-based ones, including but not limited to initially anonymizing papers for some period of time, to the ability of reviewers or moderators to flag suspicious reviews, to two types of reviewers with separate reviewer scores: expert reviewers and general reviewers. Second, to the common objection that journals are likely to select more reliable reviewers than a crowd-based system would have—since journals (particularly selective ones) may be likely to select the most highly-established experts in a field as reviewers—we argue that a variety of findings cast doubt on this. Empirical studies on peer-review indicate that interrater reliability among journal reviewers is barely better than chance, and moreover, that journal review is disproportionately conservative, preferring ‘safe’ papers over more ambitious ones. We suggest a variety of reasons for this: journals have incentives to avoid false positives (publishing bad papers); reviewers and editors have incentives to reject papers given that the journal can only accept few papers; well-established researchers have reasons to be biased in favor of the status quo; and small groups of reviewers who publish in the same area and attend conferences together may be liable to groupthink. These speculations are backed up by numerous examples in a variety of fields—including philosophy, psychology, and economics—of influential or otherwise prestigious papers (including Nobel Prize winning economics papers) being systematically rejected by journals. We argue that, whatever biases will exist in a crowd-sourced model, they are likely to be distributed more randomly. Hence, the combined judgment of crowd-sourced reviewers will be more reliable on average, not less.

If we are correct, should peer review at journals disappear? We are agnostic about this (at least as a group), as the disciplines of math and physics combine crowd-sourced peer review with journal review. Given that some may be likely to remain skeptical of online reviews, we suspect that a Rottentomatoes-like crowd-sourced peer review site—perhaps housed at PhilPapers or here—might complement rather than supplant peer-reviewed journals, in broadly the way that math and physics currently do – a ‘best of both worlds’ approach. Indeed, it would be interesting to compare how the systems work concurrently.

Would a crowd-based peer-review system like we propose actually work in practice? Would enough people partake in it? Would reviews be thoughtful and evidence-based (reflecting reviewer competence) or incompetent? Could logistical problems (such as the kinds of things that have plagued Rottentomatoes.com) be overcome? We argue that answers to these questions cannot be settled a priori, but that there are a number of reasons to be optimistic. Finally, we offer suggestions for how to ‘beta’ (and later tweak) our proposal. Only time will tell, but we believe that what we currently lack are not good reasons for attempting to create such a forum—as our paper purports to show that there are good reasons to try. What we currently lack is the will to create such a system, and we hope that our paper contributes to building this will.

Use innovative tools to teach clear and courageous thinking
Subscribe
Notify of
guest

21 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
David Wallace
2 years ago

I think the comparison to physics might be misleading. Physics doesn’t have any formal crowd-sourced peer review: arxiv.org doesn’t have a “rate my paper” function, for instance. What it does have is a pretty systematic culture of posting to arxiv at, before, or (occasionally) instead of submitting to a journal.

If the proposal is that we should develop that culture too, I’m all for it, and it would be logistically very easy to do given our existing archives (PhilPapers and philsci-archive). Indeed, to some extent we already have that culture in philosophy of physics: a respectable fraction of people, including me, post their paper physics-style. (And I’ve done so since I was a grad student.) If anything I think preprint submission has decreased among junior people in my field and I’d love to see it reversed.

If the proposal is to develop a formal, aggregative system to rank preprints, I’m ambivalent about it, but at any rate it would be going well beyond what physics does.

junior philosopher of physics
Reply to  David Wallace
2 years ago

As a junior philosopher of physics with a mixed record of uploading preprints: I am generally wary of posting preprints for anything that I intend to possibly submit to a triple-blind journal, for fear of (increasing the odds of) losing the best relevant editor at that journal. Since BJPS, in particular, is one of these journals, there is often at least one such journal on the list for any given manuscript. So, I feel this chilling effect most of the time — but it only takes the one journal to induce it.

David Wallace
Reply to  junior philosopher of physics
2 years ago

Does BJPS’s triple-blind policy extend to getting an editor to recuse if they in fact think they know who the author is? If so, that’s pretty extreme.

I have long thought that even double-blind refereeing (at least in small fields like ours) ought to be thought of as formal rather than substantive: the journal shouldn’t reveal the author’s name to the referee, but it shouldn’t police whether the referee in fact recognizes the author.

BJPS’s own policy for referees is that if you only think you recognize the author, don’t worry about it (you’re likely to be wrong) but you should tell them if you know who the author is. My own approach is to interpret ‘know’ extremely narrowly (‘I’m pretty sure I heard a talk on this at a workshop last year, and I’m pretty sure the speaker was X, but would I bet my children’s lives on it?’) – otherwise, I’d do about half as much refereeing as I do.

I think excessive focus on double- and triple-blind refereeing is harmful to the field precisely because it does discourage people from using archives. But this particular reason for being discouraged is worrying to me. I might ask someone at BJPS about it when I have the chance.

junior philosopher of physics
Reply to  David Wallace
2 years ago

I think your question was primarily rhetorical/outward directed, but in case it was indeed a question specifically back to me:

I have no idea to what extent my worry is an overreaction. As I see it, there are two bits to fiddle with: how helpful/accurate/whatever the referee process is when overseen by a non-ideal-fit editor (normalized against the ideal-fit editor), and how likely it will make it to the non-ideal-fit editor. I happen (whether for good reason or not!) to think the first of these is pretty bad — enough that I read the risk of recusing, as one reading of the triple-masked policy, as itself already risk enough.

So, all said: it would be encouraging (and very relevant to my preprint archive habits) to learn that the latter isn’t actually much of a risk at all.

Filippo Contesi
Reply to  junior philosopher of physics
2 years ago

As we say in our announcement for Freelosophy, a site I recently co-launched where crowd-reviewing of PhilArchive papers is already possible (see https://dailynous.com/2022/01/10/new-site-for-publicly-commenting-on-philosophy-papers/ ), journal publishing houses are in large part already legally/publicly committed to considering pre-archived papers. So are, as far as I can tell, most if not all individual philosophy journals published by those publishing houses. Moreover, if philosophers as a group wanted crowd-reviewing as well as anonymity, there would be technical ways to combine crowd-reviewing with journal-review anonymity (provided sufficient investment were put into the technical infrastructure).

Last edited 2 years ago by Filippo Contesi
David Bourget
David Bourget
2 years ago

Nice to see the virtues of crowd-reviewing rigorously expounded. This is something that I’ve been thinking about as a possible PhilPapers project. I continue to think about it, though these projects have taken a back seat with COVID-induced disruptions.

H. N. Torrance
2 years ago

I am dubious that we should ‘crowd source’ assessments that involve expertise that only a minority in the crowd will have. For example: even if the Condorcet jury theorem does well in cases where people have broadly the same background knowledge (e.g., how many marbles in the jar) it would be a mistake to crowd source questions about whether some theory in quantum physics should be published (or: imagine if they decided whether to publish Andrew Wiles’ FLT proof by crowdsourcing it… even though only a small handful of people even understood it). But what we say for quantum physics and math should presumably go also for philosophy (e.g., crowdsourcing the latest argument for what grounds facts about grounding, etc).

Last edited 2 years ago by H. N. Torrance
Marcus Arvan
Reply to  H. N. Torrance
2 years ago

Hi H.N.: it’s worth noting here that we build components into our proposal to address this, including mechanisms for identifying pools of expert reviewers in different subfields and reporting their scores and reviews separately from non-experts. We argue that there are advantages in providing both types of scores simultaneously. Yes, experts are experts. But experts can also be subject to groupthink and have dubious assumptions pointed out by relative outsiders. So, having both types of reviewers in a crowd-based system corrects for/balances the respect epistemic merits of both types of reviewer (and better so, we argue, that journal peer-review alone).

Kenny Easwaran
Reply to  H. N. Torrance
2 years ago

I actually think the math case goes the other direction than you think. Wiles’s FLT paper in particular is one where, after a reviewer caught a mistake in the first version, Wiles circulated drafts to ensure that his fix worked, before finally submitting it for publication.

https://en.wikipedia.org/wiki/Wiles%27s_proof_of_Fermat%27s_Last_Theorem#Announcement_and_subsequent_developments

I think the crowdsourcing idea is that as soon as you submit the paper, it’s already published (i.e., made public) – this is essential for crowdsourcing to even begin. But then, after it’s been public for a while, and reviews have come in, then the journal decides whether or not to put their stamp on it.

John Huckel
2 years ago

‘Would “an online, crowd-sourced peer-review system” work better than traditional peer-review as a “quality control device” in philosophy?’

This is the first sentence of this article. I think it is the wrong question. This would be the question I would ask: Would “an online, crowd-sourced peer-review system” enhance the traditional peer-review system?

And even if the answer were a tepid, maybe.., I think it is worth a shot. What have we got to lose? Not much! And the potential gains could fundamentally change the dynamics of innovation in theoretical thinking.

Now that we all agree that such a system would potentially be beneficial, let me refer to the last statement of the article to drill to the crux of the issue as it stands: ‘…[W]hat we currently lack are not good reasons for attempting to create such a forum—as our paper purports to show that there are good reasons to try. What we currently lack is the will to create such a system, and we hope that our paper contributes to building this will.’

Luckily, my system—The Matrix-8 Solution—will handle the logistical problems traditionally besetting democratic processes in large groups. It should put Condorcet at ease. Not only does it allow for large numbers of evaluators – my system solves the Democratic Trilemma to boot! And in Trusted Reputation, it solves the longstanding problem of differentiating between bots, trolls, and honorable participants. It is currently under development as the governance system for an up-and-coming cryptocurrency. It could easily be tweaked to fit your proposed forum’s needs.

You can see the dynamic of Trusted Reputation here, and the full White Paper for the system here

John Devereaux
2 years ago

I admire the authors’ attempt to shake things up here – but there’s an underlying worry I’ve got about this. Or maybe it’s more several related worries.

  1. The proposal doesn’t leave any space that we would presumably want to leave between overall popularity and good philosophy/science. In the old days, it would not be uncommon for views to clearly pass scientific muster by standards for good science (for example, Galileo’s thinking) while remaining unpopular on the whole. Such space is entirely coherent. The proposal here makes such space incoherent, by subsuming the former kind of standards under the latter.
  2. The view holds our philosophical theories hostage to mob rule, and the mechanisms of mob (e.g., contagion, etc.) are turbulent. Example: an argument that should clearly be published whips up a frenzy on blogs or Facebook, and the mob, whipped into such a frenzy, downvote the paper – preventing its acceptance.
  3. By identifying the ‘good’ (by way of philosophy) with the ‘popular”, we risk suppressing good philosophy that has features that don’t tend to make them popular. I am thinking here of Williamson’s quote that boring philosophy is often the best. It might not be popular.
Kenny Easwaran
2 years ago

I should probably read the more formal version, but I do worry about the use being made of the Condorcet jury theorem. That theorem essentially requires that the individual jury members are better than 50/50 at judging individually, and that their errors are uncorrelated. In the traditional peer review process, if there are two reviewers, then they make their judgments independently, without seeing each other’s reviews, which helps ensure this. But depending on how a crowd-sourced peer review works, I worry that reviewers would see each other’s ratings before making their own. This is exactly the sort of situation that often leads to information cascades, where people who have individual good evidence often suppress it in favor of the evidence that has already been publicly presented by others. This would undercut the Condorcet jury theorem, if reviewers after the first or second would all be affected by these first two (which is often all you need to get a cascade going, and would reduce it to basically standard two-referee anonymous review).

Marcus Arvan
Reply to  Kenny Easwaran
2 years ago

We address this on p. 18 and section 7.3.

Maggie
2 years ago

I haven’t read your paper but, based on the insufficient information I have, I would vote against its publication, which raises the question, “How do you get enough people to read all those papers that need to be evaluated?” Wouldn’t the vast majority of up or down votes or scores or even qualitative remarks be entered by evaluators who haven’t even read the paper in its entirety? And wouldn’t some papers just be largely ignored by the crowd?

I think your proposal would be great for people who network well or who have a large social media presence or who are members of prestigious departments or who are well-known or well-liked within the profession. It looks like a prescription for disaster for the less popular and less prestigious. Or is there some way to ensure genuinely blind review by the crowd?

ajkreider
ajkreider
2 years ago

Given the problems with current peer-review, it’s worth a shot, but I too have concerns about the logistics. I’m particularly worried about the grad student at Directional State U, or really anyone from a small department.

I would expect that such a system is going to get swamped with papers, including from a whole mess of cranks. A bunch of good papers will get lost in the swamp. But this is going to benefit students (and profs) at large institutions, because they can walk the halls and say, “Hey, I’ve got a new paper up”, and anonymity won’t help that at all. There will be all sorts of social pressure for faculty and students at those institutions to give positive reviews (some might even find it obligatory). And the student of the star professor can enlist faculty from a broader pool as well.The small department paper will barely get a look, regardless of quality.

One could make the reviewers’ names public, so that people won’t want to attach their names to marginal papers, but this will just ensure that good reviewers won’t bother to review. Relatedly, anonymous submission will encourage the submission of incomplete papers, which people won’t want to review. As reviewers know, it takes some work. When one agrees to review, one takes on an obligation to give reasonable and thorough comments. Where will such an obligation come from with such a database?

There will of course be lots of people willing to weigh in on someone’s paper, grad students ready to tear it to shreds in order to prove their philosophical chops.

There will also be issues around social media. Phil twitter has its stars, and they will be besieged with requests to talk about a paper – of which they could only do a few. But then these papers will be raised to the stratosphere, because it will get so many other eyeballs. Competition will then ensue for who gets to be in the stars’ good graces.

The ONLY place the no-name philosopher/student will have anything like an equal shot at getting their work looked at and engaged with is with anonymous peer review at a journal.

(The above is obviously speculative, but that doesn’t make it a priori.)

Neil Sinhababu
2 years ago

Is this a system that a single journal could set up if it had a suitable jury of qualified reviewers? If so, perhaps some innovative journal editor could be persuaded to try it out.

Craig
Craig
2 years ago

I am mostly curious about this:

We expect that an online, crowd-sourced system would involve many more people reviewing papers, particularly if a crowd-sourced peer-review website (built on top of preprint servers like arXiv or PhilPapers) incentivized reviewing.

1) Without incentivizing, wouldn’t the number of reviews maybe go down (because there are no longer editors needing to cajole students)?

2) What is the incentivizing scheme?

3) Why couldn’t we apply the incentivizing scheme to the current regime?

I should read the article; but also these very comments serve as a test case, in a sense, so I’ll see whether things are resolved in the comments here in a way that tempts me to actually read the full article. Ha!

Me again
Me again
Reply to  Craig
2 years ago

Needing to cajole *reviewers*

Craig
Craig
Reply to  Craig
2 years ago

I’ve now read through the essay itself. I am deeply skeptical of the confidence that the number of quality reviews would substantially increase. I suspect that the number of substantive reviews written will decrease. And even if there is a minor increase, then supposing there is a Condorcet effect, that too will be minor.

The real challenge, under both the existing system and the public system, is figuring out how to get eyeballs onto the ever-increasing volume of submissions, when giving a charitable, careful read is very burdensome in terms of time. I’m not going to dramatically increase my refereeing to be “Top Reviewer”—whereas directed requests from peers (“Hi, Craig, I know you work on X, would you mind looking at this paper?”) are psychologically efficacious.

Filippo Contesi
2 years ago

Crowd-reviewing PhilArchive papers is already possible now at: https://freelosophy.github.io/ . See an earlier Daily Nous post for more info: https://dailynous.com/2022/01/10/new-site-for-publicly-commenting-on-philosophy-papers/ .

Last edited 2 years ago by Filippo Contesi
Dan Lowe
2 years ago

The landscape of discussing philosophy with a more general audience is arguably more fruitful for one’s personal reflection already, and certainly more valuable for understanding what undergraduates from other disciplines might benefit from. But increasingly (and perhaps ironically) useless for students within the major or seeking out Philosophy for insight found nowhere else in popular culture. Anecdotal to my own experience as a philosophy graduate, the framing of Philosophy by general audiences can be extremely frustrating because it elevates shallow, uninformed but nevertheless dismissive and condescensing narratives that often obscure and betray the very contributions made by philosophers historically. The oddity of a tightly tended field is one of the very things that preserves unique perspectives that can offer literally life-saving realizations, one of Philosophy’s most essential qualities.

Given the rising popularity of gender studies (which was also my own focus in the mid ’00s) and the emotional/personal-political burden that comes with navigating its contemporary peer pressures, I don’t think it’s an exaggeration to say I fear how crowdsourcing would elevate more normative conclusions and how that might impact the mental health of students already struggling to find a solid ground to negotiate their own queer or countercultural development. The Philosophy classroom, and moreover the Philosophy major is a harbor for unpopular ideas, and the nature of crowdsourcing is very much a popularity contest. It’s not merely a laboratory for innovating our understanding of cognitive behavior, but a cultural failsafe to check the all too rapid tendency to get swept up in cultural revolutions. Especially given the Internet’s ability to produce these ‘revolutions’ then just as suddenly grow bored of them.

Anyone who’s ever deleted a tweet out of fear of offending their own genuinely precious but also fragile networks knows what I’m talking about. The crowd is where novel ideas go to die. Not because of what people think, but because of the psychological tendency to dwell on what we think people will think. An engineer on chemist is often an intermediary for their instruments, those that exist or can be affordable accessed, and those that have yet to be realized. While a philosopher is very much ‘just a girl, standing in front of a boy, asking him to love her.’

This is a nebulous topic and, as always, stretches the scope of the present discussion, but the same characteristics that can lead one to conclude that formal Philosophy is ‘out of touch’ also contribute to its timelessness, its resilience in the face of merely transitional trends. The challenging nature of academic Philosophy rooted in its formally derived foundations guarantees a safe space for personal introspection in an ocean of shallow, cruel and knee-jerk criticism. The tough love of a rigorously structured framework derived from decades and centuries of insular safe spaces that strengthens your resolve to navigate the fallacies of public life.

As they’d say in Portland, Edmonton or Missoula: Keep Philosophy Weird.