An Empirical Comparison of Bias Reduction Methods on Real-World Problems in High-Stakes Policy Settings

Hemank Lamba,Kit T. Rodolfa,Rayid Ghani
DOI: https://doi.org/10.48550/arXiv.2105.06442
2021-05-14
Abstract:Applications of machine learning (ML) to high-stakes policy settings -- such as education, criminal justice, healthcare, and social service delivery -- have grown rapidly in recent years, sparking important conversations about how to ensure fair outcomes from these systems. The machine learning research community has responded to this challenge with a wide array of proposed fairness-enhancing strategies for ML models, but despite the large number of methods that have been developed, little empirical work exists evaluating these methods in real-world settings. Here, we seek to fill this research gap by investigating the performance of several methods that operate at different points in the ML pipeline across four real-world public policy and social good problems. Across these problems, we find a wide degree of variability and inconsistency in the ability of many of these methods to improve model fairness, but post-processing by choosing group-specific score thresholds consistently removes disparities, with important implications for both the ML research community and practitioners deploying machine learning to inform consequential policy decisions.
Machine Learning,Computers and Society
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how machine learning models in high - risk policy environments can reduce bias and ensure fairness. Specifically, although many methods have been proposed currently to enhance the fairness of machine learning models, the actual effects of these methods in the real world have not been fully evaluated. Therefore, this paper aims to fill this gap through empirical research, that is, to evaluate the performance of multiple bias - reduction strategies at different machine - learning pipeline stages (pre - processing, in - processing, post - processing) in four practical problems from the fields of public policy and social welfare. The study found that there are significant differences and inconsistencies in the ability of many methods to improve model fairness, but post - processing by selecting scoring thresholds for specific groups can consistently eliminate differences, which has important implications for both the machine - learning research community and practitioners.