Abstract:With fairness concerns gaining significant attention in Machine Learning (ML), several bias mitigation techniques have been proposed, often compared against each other to find the best method. These benchmarking efforts tend to use a common setup for evaluation under the assumption that providing a uniform environment ensures a fair comparison. However, bias mitigation techniques are sensitive to hyperparameter choices, random seeds, feature selection, etc., meaning that comparison on just one setting can unfairly favour certain algorithms. In this work, we show significant variance in fairness achieved by several algorithms and the influence of the learning pipeline on fairness scores. We highlight that most bias mitigation techniques can achieve comparable performance, given the freedom to perform hyperparameter optimization, suggesting that the choice of the evaluation parameters-rather than the mitigation technique itself-can sometimes create the perceived superiority of one method over another. We hope our work encourages future research on how various choices in the lifecycle of developing an algorithm impact fairness, and trends that guide the selection of appropriate algorithms.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that the benchmarking methods for fairness evaluation and bias - mitigation techniques in existing machine - learning (ML) systems are too simplistic. They are usually compared under the same experimental settings, which may lead to unfair results. Specifically, the paper points out: 1. **Limitations of Existing Benchmarking Methods**: - Most of the current fairness benchmarking methods adopt a unified experimental environment (such as hyper - parameters, random seeds, etc.) to ensure more accurate and fair comparison. - However, these methods ignore the high sensitivity of bias - mitigation techniques to factors such as hyper - parameter selection, random seeds, and feature selection. This may cause some algorithms to perform better in certain settings and poorly in others. 2. **Performance Differences of Bias - Mitigation Techniques**: - The paper proves through experiments that different bias - mitigation algorithms show significant performance differences under different hyper - parameter settings. - For example, some algorithms can achieve better fairness under specific hyper - parameter settings, but may perform poorly in other settings. 3. **Complexity of Selecting the Best Algorithm**: - The paper emphasizes that no algorithm can be "the best" in all cases, because different algorithms may perform differently in different settings. - Therefore, selecting the most appropriate bias - mitigation technique requires considering more factors, such as running time, complexity, theoretical guarantees, etc., rather than just the trade - off between fairness and utility. ### Formula Representation To better understand the experimental results in the paper, the following are the Markdown - format representations of some key formulas and concepts: - **Fairness Metrics**: - **Demographic Parity (DP)**: \[ \text{DP} = P(\hat{Y}=1 | A=a) = P(\hat{Y}=1 | A=b) \] - **Accuracy**: \[ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{FN} + \text{TN}} \] - **Impact of Hyper - parameter Optimization**: - Different hyper - parameter settings (such as batch size \( B \), learning rate \( \eta \), model architecture, etc.) will have a significant impact on fairness and utility. ### Summary The core objective of the paper is to emphasize the limitations of current fairness benchmarking methods and propose a more detailed and context - aware evaluation method in order to more comprehensively understand and select the most suitable bias - mitigation technique. Through a large number of experiments, it is proved that different algorithms perform differently under different hyper - parameter settings, so multiple factors need to be considered comprehensively to select the most appropriate algorithm.

Different Horses for Different Courses: Comparing Bias Mitigation Algorithms in ML

Metrics and methods for a systematic comparison of fairness-aware machine learning algorithms

An Empirical Comparison of Bias Reduction Methods on Real-World Problems in High-Stakes Policy Settings

A Comprehensive Empirical Study of Bias Mitigation Methods for Machine Learning Classifiers

Are Bias Mitigation Techniques for Deep Learning Effective?

Fix Fairness, Don't Ruin Accuracy: Performance Aware Fairness Repair using AutoML

Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey

Do the Machine Learning Models on a Crowd Sourced Platform Exhibit Bias? An Empirical Study on Model Fairness

Towards A Holistic View of Bias in Machine Learning: Bridging Algorithmic Fairness and Imbalanced Learning

Towards Fair Machine Learning Software: Understanding and Addressing Model Bias Through Counterfactual Thinking

Bias, Fairness, and Accountability with AI and ML Algorithms

A novel approach for assessing fairness in deployed machine learning algorithms

Simultaneous Improvement of ML Model Fairness and Performance by Identifying Bias in Data

Fairness-aware Configuration of Machine Learning Libraries

Data vs. Model Machine Learning Fairness Testing: An Empirical Study

When mitigating bias is unfair: multiplicity and arbitrariness in algorithmic group fairness

Whither Bias Goes, I Will Go: An Integrative, Systematic Review of Algorithmic Bias Mitigation

Putting Fairness Principles into Practice: Challenges, Metrics, and Improvements

'Propose and Review': Interactive Bias Mitigation for Machine Classifiers

Evaluating Fairness Using Permutation Tests

MEDFAIR: Benchmarking Fairness for Medical Imaging