Zikai Xiong,Niccolò Dalmasso,Alan Mishler,Vamsi K. Potluru,Tucker Balch,Manuela Veloso
Abstract:Recent years have seen a surge of machine learning approaches aimed at reducing disparities in model outputs across different subgroups. In many settings, training data may be used in multiple downstream applications by different users, which means it may be most effective to intervene on the training data itself. In this work, we present FairWASP, a novel pre-processing approach designed to reduce disparities in classification datasets without modifying the original data. FairWASP returns sample-level weights such that the reweighted dataset minimizes the Wasserstein distance to the original dataset while satisfying (an empirical version of) demographic parity, a popular fairness criterion. We show theoretically that integer weights are optimal, which means our method can be equivalently understood as duplicating or eliminating samples. FairWASP can therefore be used to construct datasets which can be fed into any classification method, not just methods which accept sample weights. Our work is based on reformulating the pre-processing task as a large-scale mixed-integer program (MIP), for which we propose a highly efficient algorithm based on the cutting plane method. Experiments demonstrate that our proposed optimization algorithm significantly outperforms state-of-the-art commercial solvers in solving both the MIP and its linear program relaxation. Further experiments highlight the competitive performance of FairWASP in reducing disparities while preserving accuracy in downstream classification settings.
What problem does this paper attempt to address?
### What problem does this paper attempt to solve?
This paper aims to solve the problem of output differences among different subgroups in machine - learning models, especially reducing these differences in classification datasets. Specifically, the authors propose a pre - processing method named **FairWASP**, which minimizes the Wasserstein distance between the re - weighted dataset and the original dataset by re - weighting training data samples, while ensuring that (the empirical version of) demographic parity, a common fairness criterion, is satisfied.
#### Specific problem description
1. **Background problems**:
- Many machine - learning models may inherit biases against certain protected characteristics (such as race, gender, etc.) from the data.
- In many application scenarios, training data may be used by multiple users for different downstream tasks, so intervening in the training data itself may be the most effective strategy.
2. **Limitations of existing methods**:
- Existing pre - processing methods usually reduce bias by changing feature values or labels, over - sampling or under - sampling data, generating synthetic data, etc. However, in high - risk fields such as finance and healthcare, directly modifying the attributes or labels of customers or patients may be unethical or even illegal.
- Some methods achieve fairness by learning sample weights, but these methods usually cannot guarantee the degree of perturbation of the data distribution.
3. **Innovations of FairWASP**:
- **Lossless data modification**: FairWASP does not modify the original data but adjusts the dataset by learning a set of sample weights, thereby minimizing the Wasserstein distance and ensuring fairness.
- **Integer weight optimization**: It is theoretically proven that integer weights are optimal, which means that FairWASP can be implemented by copying or deleting samples and is applicable to any classification algorithm.
- **Efficient solution algorithm**: A highly effective algorithm based on the cutting - plane method is proposed, which is significantly better than existing commercial solvers.
4. **Mathematical formulation of the optimization problem**:
- Given a dataset \( Z=\{(D_i, X_i, Y_i)\}_{i = 1}^n \), the goal is to find a set of sample weights \( \theta=\{\theta_i\}_{i = 1}^n \) such that the Wasserstein distance between the re - weighted data distribution \( p_Z;\theta \) and the original data distribution \( p_Z;e \) is minimized and the demographic parity constraint is satisfied:
\[
\min_{\theta\in I_n\cap\Delta_n}W_c(p_Z;\theta, p_Z;e)
\]
where \( I_n \) is the set of integer vectors, \( \Delta_n \) is the set of legal weights, and the constraint is:
\[
J(p_Z;\theta(Y = y|D = d), p_Y(y))\leq\epsilon,\quad\forall d\in D, y\in Y
\]
Here \( J(\cdot,\cdot) \) represents the probability - ratio distance metric.
5. **Experimental verification**:
- Experiments show that FairWASP reduces differences while maintaining the accuracy of downstream classification tasks, and its optimization algorithm is significantly better than existing commercial solvers when solving large - scale mixed - integer programming problems.
In summary, this paper solves the important problem of how to reduce inter - group differences in machine - learning models in the pre - processing stage without modifying the original data by proposing the FairWASP method.