Abstract:Recent years have seen a surge of machine learning approaches aimed at reducing disparities in model outputs across different subgroups. In many settings, training data may be used in multiple downstream applications by different users, which means it may be most effective to intervene on the training data itself. In this work, we present FairWASP, a novel pre-processing approach designed to reduce disparities in classification datasets without modifying the original data. FairWASP returns sample-level weights such that the reweighted dataset minimizes the Wasserstein distance to the original dataset while satisfying (an empirical version of) demographic parity, a popular fairness criterion. We show theoretically that integer weights are optimal, which means our method can be equivalently understood as duplicating or eliminating samples. FairWASP can therefore be used to construct datasets which can be fed into any classification method, not just methods which accept sample weights. Our work is based on reformulating the pre-processing task as a large-scale mixed-integer program (MIP), for which we propose a highly efficient algorithm based on the cutting plane method. Experiments demonstrate that our proposed optimization algorithm significantly outperforms state-of-the-art commercial solvers in solving both the MIP and its linear program relaxation. Further experiments highlight the competitive performance of FairWASP in reducing disparities while preserving accuracy in downstream classification settings.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of output differences among different subgroups in machine - learning models, especially reducing these differences in classification datasets. Specifically, the authors propose a pre - processing method named **FairWASP**, which minimizes the Wasserstein distance between the re - weighted dataset and the original dataset by re - weighting training data samples, while ensuring that (the empirical version of) demographic parity, a common fairness criterion, is satisfied. #### Specific problem description 1. **Background problems**: - Many machine - learning models may inherit biases against certain protected characteristics (such as race, gender, etc.) from the data. - In many application scenarios, training data may be used by multiple users for different downstream tasks, so intervening in the training data itself may be the most effective strategy. 2. **Limitations of existing methods**: - Existing pre - processing methods usually reduce bias by changing feature values or labels, over - sampling or under - sampling data, generating synthetic data, etc. However, in high - risk fields such as finance and healthcare, directly modifying the attributes or labels of customers or patients may be unethical or even illegal. - Some methods achieve fairness by learning sample weights, but these methods usually cannot guarantee the degree of perturbation of the data distribution. 3. **Innovations of FairWASP**: - **Lossless data modification**: FairWASP does not modify the original data but adjusts the dataset by learning a set of sample weights, thereby minimizing the Wasserstein distance and ensuring fairness. - **Integer weight optimization**: It is theoretically proven that integer weights are optimal, which means that FairWASP can be implemented by copying or deleting samples and is applicable to any classification algorithm. - **Efficient solution algorithm**: A highly effective algorithm based on the cutting - plane method is proposed, which is significantly better than existing commercial solvers. 4. **Mathematical formulation of the optimization problem**: - Given a dataset \( Z=\{(D_i, X_i, Y_i)\}_{i = 1}^n \), the goal is to find a set of sample weights \( \theta=\{\theta_i\}_{i = 1}^n \) such that the Wasserstein distance between the re - weighted data distribution \( p_Z;\theta \) and the original data distribution \( p_Z;e \) is minimized and the demographic parity constraint is satisfied: \[ \min_{\theta\in I_n\cap\Delta_n}W_c(p_Z;\theta, p_Z;e) \] where \( I_n \) is the set of integer vectors, \( \Delta_n \) is the set of legal weights, and the constraint is: \[ J(p_Z;\theta(Y = y|D = d), p_Y(y))\leq\epsilon,\quad\forall d\in D, y\in Y \] Here \( J(\cdot,\cdot) \) represents the probability - ratio distance metric. 5. **Experimental verification**: - Experiments show that FairWASP reduces differences while maintaining the accuracy of downstream classification tasks, and its optimization algorithm is significantly better than existing commercial solvers when solving large - scale mixed - integer programming problems. In summary, this paper solves the important problem of how to reduce inter - group differences in machine - learning models in the pre - processing stage without modifying the original data by proposing the FairWASP method.

FairWASP: Fast and Optimal Fair Wasserstein Pre-processing

FairWASP: Fast and Optimal Fair Wasserstein Pre-processing

Fairness with Adaptive Weights.

Fair Data Representation for Machine Learning at the Pareto Frontier

FairBalance: How to Achieve Equalized Odds With Data Pre-processing

Fair Wasserstein Coresets

Fair and Optimal Classification via Post-Processing

Adversarial Reweighting Guided by Wasserstein Distance for Bias Mitigation

FairIF: Boosting Fairness in Deep Learning via Influence Functions with Validation Set Sensitive Attributes

Fairness-Aware Oversampling Algorithm Based on Distributions of Sensitive Attributes

Wasserstein Robust Classification with Fairness Constraints

AdapFair: Ensuring Continuous Fairness for Machine Learning Operations

WassFFed: Wasserstein Fair Federated Learning

Alpha and Prejudice: Improving $α$-sized Worst-case Fairness via Intrinsic Reweighting

Distributionally Fair Stochastic Optimization using Wasserstein Distance

Boosting Fair Classifier Generalization through Adaptive Priority Reweighing

Fairness in Multi-Task Learning via Wasserstein Barycenters

FairPrep: Promoting Data to a First-Class Citizen in Studies on Fairness-Enhancing Interventions

Comprehensive Validation on Reweighting Samples for Bias Mitigation via AIF360

A Burden Shared is a Burden Halved: A Fairness-Adjusted Approach to Classification

A refined reweighing technique for nondiscriminatory classification