Does Referee Gender Make a Difference?


Once again, Jonathan Weisberg (Toronto), one of the managing editors of Ergolooks at the journal’s data to see what, if anything, can be learned from it. This time, he focuses on what difference the gender of an article’s referee makes.

Professor Weisberg finds that at Ergo:

  1. A greater percentage of men than women who are asked agree to referee requests.
  2. Men, on average, complete referee reports in less time than women.
  3. Women referees recommend papers be rejected or undergo major revision more than men.*
  4. Men referees recommend papers be accepted or undergo only minor revision more than women.*
  5. The gender of the referee makes no difference to whether the editor follows the referee’s recommendation.

* Regarding the differences reported in 3 and 4, Professor Weisberg says that a chi-square test of independence finds the differences are not statistically significant.

Professor Weisberg does not report on whether he looked into whether referee recommendations differed by gender when the genders of the papers’ authors were taken into account. For example, do men and women referees differ in their recommendations of papers written by women? That would be interesting to learn.

For the numbers and details, visit Professor Weisberg’s blog.

 

Warwick University MA in Philosophy
Subscribe
Notify of
guest

8 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Tom Hurka
Tom Hurka
7 years ago

Re 3 and 4: have you looked at what difference the age of the referee makes? My impression from when I was editing a journal was that younger referees tend to be tougher than older ones, who more often recommend acceptance. If women referees are on average younger, because there’s been more hiring of women recently, that could help explain your (admittedly weak) correlations under 3 and 4.

Jonathan Weisberg
Jonathan Weisberg
Reply to  Tom Hurka
7 years ago

Thanks, Tom, that’s a good point about juniority and severity. I’ve always wanted to test that bit of folk-wisdom, and I hope to look at seniority (and prestige) more generally down the road, if I can get the data.

But I want to stress that there may be nothing to explain here in the first place. Not only were the differences described in my post non-significant, the p-value was .31.

We can slice the data more coarsely, as Justin suggests: lumping together (Reject + Major Revisions) and (Minor Revisions + Accept). That drives the p-value down to .09. But that’s still not significant by the usual .05 standard.

Moreover, once we start playing the slice-and-dice game, we can slice things other ways, too. We could lump things into Reject vs. non-Reject, or we could lump Accept vs. non-Accept. Then we get p-values of .37 and .39, respectively.

So, however you slice it, this is a null result by the usual standard. (But somebody please check my arithmetic, I’m new at this.)

The particular slicing Justin describes does come close-ish, at .09. But I would think that merits at most second look using another journal’s data, not a conclusion to be hypothesized about. (Also, if I understand the math of chi-square tests correctly—again, a big “if”—that .09 estimate would likely be corrected upwards by a more accurate method.)

Moreover, I’m no stats maven (said the formal epistemologist), but all this slicing and dicing makes me queazy (https://xkcd.com/882/). I gather it’s the sort of p-hackery I keep seeing real statisticians shake their heads at.

Maybe those who know more can speak to rationales for lumping things one way or another. Otherwise, my (admittedly rudimentary) understanding of NHST methodology leaves me wary of any lumpy analysis.

recent grad
recent grad
7 years ago

Professor Weisberg,

If you ever have the time, I’d be curious whether there’s anything statistically significant happening in cases of conflicting recommendations. For example, do editors tend to side with prestigious, and therefore older, referees? Or do they tend to side with better/longer reports (which I would think tend to come from younger referees)? Or does the lowest recommendation always win out? Etc.

Jonathan Weisberg
Jonathan Weisberg
Reply to  recent grad
7 years ago

Funny you should ask, I just drafted a post on this. It only addresses the last version of your question (it’s mainly “for entertainment purposes”). But I’ll share a link here when it’s ready—probably tomorrow.

recent grad
recent grad
Reply to  Jonathan Weisberg
7 years ago

Cool. I look forward to it.

Jonathan Weisberg
Jonathan Weisberg
Reply to  Jonathan Weisberg
7 years ago
Carolyn Jennings
Carolyn Jennings
Reply to  Jonathan Weisberg
7 years ago

That’s a great post! Thanks!

Tim Kenyon
Tim Kenyon
7 years ago

Some very interesting material; thanks very much for this, Jonathan and Justin. One demurral: as written, (3) and (4) are pretty stark generalizations of gender — hard to unsee, once seen. If the correlations are not significant (and, crikey, p = .31), why include them in the first place? Lots of empirical research suggests that the statements will have more lasting influence than the asterisks.