Different Horses for Different Courses: Comparing Bias Mitigation Algorithms in ML

Prakhar Ganesh,Usman Gohar,Lu Cheng,Golnoosh Farnadi
2024-11-19
Abstract:With fairness concerns gaining significant attention in Machine Learning (ML), several bias mitigation techniques have been proposed, often compared against each other to find the best method. These benchmarking efforts tend to use a common setup for evaluation under the assumption that providing a uniform environment ensures a fair comparison. However, bias mitigation techniques are sensitive to hyperparameter choices, random seeds, feature selection, etc., meaning that comparison on just one setting can unfairly favour certain algorithms. In this work, we show significant variance in fairness achieved by several algorithms and the influence of the learning pipeline on fairness scores. We highlight that most bias mitigation techniques can achieve comparable performance, given the freedom to perform hyperparameter optimization, suggesting that the choice of the evaluation parameters-rather than the mitigation technique itself-can sometimes create the perceived superiority of one method over another. We hope our work encourages future research on how various choices in the lifecycle of developing an algorithm impact fairness, and trends that guide the selection of appropriate algorithms.
Machine Learning,Artificial Intelligence,Computers and Society
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that the benchmarking methods for fairness evaluation and bias - mitigation techniques in existing machine - learning (ML) systems are too simplistic. They are usually compared under the same experimental settings, which may lead to unfair results. Specifically, the paper points out: 1. **Limitations of Existing Benchmarking Methods**: - Most of the current fairness benchmarking methods adopt a unified experimental environment (such as hyper - parameters, random seeds, etc.) to ensure more accurate and fair comparison. - However, these methods ignore the high sensitivity of bias - mitigation techniques to factors such as hyper - parameter selection, random seeds, and feature selection. This may cause some algorithms to perform better in certain settings and poorly in others. 2. **Performance Differences of Bias - Mitigation Techniques**: - The paper proves through experiments that different bias - mitigation algorithms show significant performance differences under different hyper - parameter settings. - For example, some algorithms can achieve better fairness under specific hyper - parameter settings, but may perform poorly in other settings. 3. **Complexity of Selecting the Best Algorithm**: - The paper emphasizes that no algorithm can be "the best" in all cases, because different algorithms may perform differently in different settings. - Therefore, selecting the most appropriate bias - mitigation technique requires considering more factors, such as running time, complexity, theoretical guarantees, etc., rather than just the trade - off between fairness and utility. ### Formula Representation To better understand the experimental results in the paper, the following are the Markdown - format representations of some key formulas and concepts: - **Fairness Metrics**: - **Demographic Parity (DP)**: \[ \text{DP} = P(\hat{Y}=1 | A=a) = P(\hat{Y}=1 | A=b) \] - **Accuracy**: \[ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{FN} + \text{TN}} \] - **Impact of Hyper - parameter Optimization**: - Different hyper - parameter settings (such as batch size \( B \), learning rate \( \eta \), model architecture, etc.) will have a significant impact on fairness and utility. ### Summary The core objective of the paper is to emphasize the limitations of current fairness benchmarking methods and propose a more detailed and context - aware evaluation method in order to more comprehensively understand and select the most suitable bias - mitigation technique. Through a large number of experiments, it is proved that different algorithms perform differently under different hyper - parameter settings, so multiple factors need to be considered comprehensively to select the most appropriate algorithm.