Abstract:In this paper, we interrogate whether data quality issues track demographic group membership (based on sex, race and age) and whether automated data cleaning — of the kind commonly used in production ML systems — impacts the fairness of predictions made by these systems. To the best of our knowledge, the impact of data cleaning on fairness in downstream tasks has not been investigated in the literature. We first analyse the tuples flagged by common error detection strategies in five research datasets. We find that, while specific data quality issues, such as higher rates of missing values, are associated with membership in historically disadvantaged groups, poor data quality does not generally track demographic group membership. As a follow-up, we conduct a large-scale empirical study on the impact of automated data cleaning on fairness, involving more than 26,000 model evaluations. We observe that, while automated data cleaning is unlikely to worsen accuracy, it is more likely to worsen fairness than to improve it, especially when the cleaning techniques are not carefully chosen. Furthermore, we find that the positive or negative impact of a particular cleaning technique often depends on the choice of fairness metric and group definition (single-attribute or intersectional). We make our code and experimental results publicly available. The analysis we conducted in this paper is difficult, primarily because it requires that we think holistically about disparities in data quality, disparities in the effectiveness of data cleaning methods, and impacts of such disparities on ML model performance for different demographic groups. Such holistic analysis can and should be supported by data engineering tools, and requires substantial data engineering research. Towards this goal, we discuss open research questions, envision the development of fairness-aware data cleaning methods, and their integration into complex pipelines for ML-based decision making.

Fair and Private Data Preprocessing through Microaggregation

The Impact of Data Preparation on the Fairness of Software Systems

FairPrep: Promoting Data to a First-Class Citizen in Studies on Fairness-Enhancing Interventions

Fairness-Driven Private Collaborative Machine Learning

Statistical Privacy Guarantees of Machine Learning Preprocessing Techniques

Privacy at a Price: Exploring its Dual Impact on AI Fairness

Provable Privacy with Non-Private Pre-Processing

Fair Decision Making using Privacy-Protected Data

Data Preparation for Fairness-Performance Trade-Offs: A Practitioner-Friendly Alternative?

A Canonical Data Transformation for Achieving Inter- and Within-group Fairness

Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making

Towards Fairness and Privacy: A Novel Data Pre-processing Optimization Framework for Non-binary Protected Attributes

Decision Making with Differential Privacy under a Fairness Lens

Fairness Issues and Mitigations in (Differentially Private) Socio-demographic Data Processes

Fairness in Machine Learning with Tractable Models

Lazy Data Practices Harm Fairness Research

Fair Data Representation for Machine Learning at the Pareto Frontier

Fair Generalized Linear Mixed Models

A Systematic and Formal Study of the Impact of Local Differential Privacy on Fairness: Preliminary Results

Fair Preprocessing: Towards Understanding Compositional Fairness of Data Transformers in Machine Learning Pipeline

Differentially Private Post-Processing for Fair Regression