Predicting the future impact of Computer Science researchers: Is there a gender bias?

Matthias Kuppler
DOI: https://doi.org/10.1007/s11192-022-04337-2
IF: 3.801
2022-04-07
Scientometrics
Abstract:Abstract The advent of large-scale bibliographic databases and powerful prediction algorithms led to calls for data-driven approaches for targeting scarce funds at researchers with high predicted future scientific impact. The potential side-effects and fairness implications of such approaches are unknown, however. Using a large-scale bibliographic data set of N = 111,156 Computer Science researchers active from 1993 to 2016, I build and evaluate a realistic scientific impact prediction model. Given the persistent under-representation of women in Computer Science, the model is audited for disparate impact based on gender. Random forests and Gradient Boosting Machines are used to predict researchers’ h -index in 2010 from their bibliographic profiles in 2005. Based on model predictions, it is determined whether the researcher will become a high-performer with an h -index in the top-25% of the discipline-specific h -index distribution. The models predict the future h -index with an accuracy of $$R^2 = 0.875$$ R 2 = 0.875 and correctly classify 91.0% of researchers as high-performers and low-performers. Overall accuracy does not vary strongly across researcher gender. Nevertheless, there is indication of disparate impact against women. The models under-estimate the true h -index of female researchers more strongly than the h -index of male researchers. Further, women are 8.6% less likely to be predicted to become high-performers than men. In practice, hiring, tenure, and funding decisions that are based on model predictions risk to perpetuate the under-representation of women in Computer Science.
information science & library science,computer science, interdisciplinary applications
What problem does this paper attempt to address?