Gender Gap in Natural Language Processing Research: Disparities in Authorship and Citations

Saif M. Mohammad
DOI: https://doi.org/10.48550/arXiv.2005.00962
2020-09-04
Abstract:Disparities in authorship and citations across gender can have substantial adverse consequences not just on the disadvantaged genders, but also on the field of study as a whole. Measuring gender gaps is a crucial step towards addressing them. In this work, we examine female first author percentages and the citations to their papers in Natural Language Processing (1965 to 2019). We determine aggregate-level statistics using an existing manually curated author--gender list as well as first names strongly associated with a gender. We find that only about 29% of first authors are female and only about 25% of last authors are female. Notably, this percentage has not improved since the mid 2000s. We also show that, on average, female first authors are cited less than male first authors, even when controlling for experience and area of research. Finally, we discuss the ethical considerations involved in automatic demographic analysis.
Digital Libraries,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the gender gap issue in the field of natural language processing (NLP) research. Specifically, by analyzing NLP paper data from 1965 to 2019, the author explored the following aspects: 1. **Proportion of female authors**: The paper examined the proportion of women as first authors and last authors, and analyzed the trend of these proportions over time. 2. **Citation differences**: The paper studied the differences in the number of citations between papers with female first authors and those with male first authors, and whether such differences were affected by the researchers' experience (such as academic age) and different research fields. 3. **Ethical considerations**: The paper discussed the possible ethical issues involved in automatic demographic analysis, especially in terms of gender inference. ### Main findings - **Proportion of female authors**: Overall, approximately 29% of the first authors and 25% of the last authors were women. Since 2006, these proportions have not improved significantly. - **Citation differences**: Papers with female first authors were cited on average fewer times than those with male first authors (37.6 times vs 50.4 times), and this difference was statistically significant. - **Changes over time**: In the early period (1965 - 1989), papers with female first authors were cited more frequently, but after the 1990s, the number of citations of papers with male first authors increased significantly. After 2000, this gap has decreased. - **Impact of academic age**: Papers with female first authors were cited fewer times at all academic age stages than those with male first authors, indicating that the citation gap is not solely due to a higher proportion of new female researchers. ### Discussion - **Impact of gender gap**: The gender gap is not only detrimental to the affected gender groups but also has a negative impact on the entire research field. Improving gender balance can lead to higher productivity, better health and well - being, greater economic benefits, better decision - making, and political and economic stability. - **Ethical issues**: Automatic inference of an individual's gender may cause harm, so the authors emphasize that their work is not aimed at inferring the gender of individual authors, but rather at determining overall statistical results through the association between names and genders. Through these analyses, the paper hopes to raise awareness of the gender gap and inspire the adoption of specific measures to improve the inclusiveness and fairness in the research field.