Abstract:Machine learning (ML) may be oblivious to human bias but it is not immune to its perpetuation. Marginalisation and iniquitous group representation are often traceable in the very data used for training, and may be reflected or even enhanced by the learning models. In the present work, we aim at clarifying the role played by data geometry in the emergence of ML bias. We introduce an exactly solvable high-dimensional model of data imbalance, where parametric control over the many bias-inducing factors allows for an extensive exploration of the bias inheritance mechanism. Through the tools of statistical physics, we analytically characterise the typical properties of learning models trained in this synthetic framework and obtain exact predictions for the observables that are commonly employed for fairness assessment. Despite the simplicity of the data model, we retrace and unpack typical unfairness behaviour observed on real-world datasets. We also obtain a detailed analytical characterisation of a class of bias mitigation strategies. We first consider a basic loss-reweighing scheme, which allows for an implicit minimisation of different unfairness metrics, and quantify the incompatibilities between some existing fairness criteria. Then, we consider a novel mitigation strategy based on a matched inference approach, consisting in the introduction of coupled learning models. Our theoretical analysis of this approach shows that the coupled strategy can strike superior fairness-accuracy trade-offs.

What problem does this paper attempt to address?

This paper attempts to address the issue of bias in machine learning (ML) systems caused by data imbalance during training. Specifically, the authors focus on how the geometric structure of data affects the bias generation mechanism in machine learning models. By introducing a high-dimensional, precisely solvable data imbalance model, the authors aim to explore and analyze the impact of different bias-inducing factors on model performance and provide theoretical explanations. ### Main Research Questions 1. **Relationship between Data Geometric Structure and Bias**: Investigate how geometric characteristics in datasets (such as the representation ratio and variance of different subgroups) affect bias in machine learning models. 2. **Effectiveness of Bias Mitigation Strategies**: Evaluate the effectiveness of different bias mitigation strategies (such as loss reweighting schemes and coupled inference settings) in reducing model bias. 3. **Trade-off between Fairness and Accuracy**: Explore the trade-off between fairness and accuracy among different subgroups under varying dataset sizes. ### Research Methods - **Teacher-Mixture (T-M) Model**: The authors propose a new generative model that can produce high-dimensional correlated data, allowing precise control and analysis of data imbalance and bias generation mechanisms. - **Statistical Physics Tools**: Utilize methods from statistical physics to analytically characterize the typical performance of trained models and obtain precise predictions of common fairness evaluation metrics. - **Experimental Validation**: Validate the theoretical analysis results through numerical simulations to ensure the model's effectiveness and accuracy. ### Main Findings - **Impact of Data Imbalance**: Factors such as the representation ratio and variance of different subgroups in the dataset significantly affect model bias. Even in solvable tasks, trained models may exhibit bias. - **Bias Mitigation Strategies**: Loss reweighting schemes can reduce bias to some extent, but there is incompatibility between different fairness standards. Coupled inference settings can achieve a better fairness-accuracy trade-off. - **Positive Transfer Effect**: When the rules of different subgroups are sufficiently similar, joint training can improve the accuracy of smaller subgroups, thereby achieving a positive transfer effect. ### Conclusion By introducing and analyzing the T-M model, this study provides a theoretical foundation for understanding the impact of data geometric structure on machine learning bias and proposes effective bias mitigation strategies. These findings contribute to the design of more fair and accurate machine learning systems, especially when dealing with real-world data.

Bias-inducing geometries: an exactly solvable data model with fairness implications

Towards A Holistic View of Bias in Machine Learning: Bridging Algorithmic Fairness and Imbalanced Learning

Bias in Motion: Theoretical Insights into the Dynamics of Bias in SGD Training

Does Machine Bring in Extra Bias in Learning? Approximating Fairness in Models Promptly

Understanding Unfairness in Fraud Detection through Model and Data Bias Interactions

Simultaneous Improvement of ML Model Fairness and Performance by Identifying Bias in Data

Aleatoric and Epistemic Discrimination: Fundamental Limits of Fairness Interventions

Towards Fair Machine Learning Software: Understanding and Addressing Model Bias Through Counterfactual Thinking

Managing bias and unfairness in data for decision support: a survey of machine learning and data engineering approaches to identify and mitigate bias and unfairness within data management and analytics systems

Fairness-aware machine learning: a perspective

Fairness And Performance In Harmony: Data Debiasing Is All You Need

When mitigating bias is unfair: multiplicity and arbitrariness in algorithmic group fairness

Data vs. Model Machine Learning Fairness Testing: An Empirical Study

Understanding Bias in Machine Learning

Optimisation Strategies for Ensuring Fairness in Machine Learning: With and Without Demographics

The Unfairness of Fair Machine Learning: Levelling down and strict egalitarianism by default

Fairness: from the ethical principle to the practice of Machine Learning development as an ongoing agreement with stakeholders

Toward A Logical Theory Of Fairness and Bias

How Far Can Fairness Constraints Help Recover From Biased Data?

AIM: Attributing, Interpreting, Mitigating Data Unfairness

Bias in Machine Learning Software: Why? How? What to do?