Abstract:Our society collects data on people for a wide range of applications, from building a census for policy evaluation to running meaningful clinical trials. To collect data, we typically sample individuals with the goal of accurately representing a population of interest. However, current sampling processes often collect data opportunistically from data sources, which can lead to datasets that are biased and not representative, i.e., the collected dataset does not accurately reflect the distribution of demographics of the true population. This is a concern because subgroups within the population can be under- or over-represented in a dataset, which may harm generalizability and lead to an unequal distribution of benefits and harms from downstream tasks that use such datasets (e.g., algorithmic bias in medical decision-making algorithms). In this paper, we assess the relationship between dataset representativeness and group-fairness of classifiers trained on that dataset. We demonstrate that there is a natural tension between dataset representativeness and classifier fairness; empirically we observe that training datasets with better representativeness can frequently result in classifiers with higher rates of unfairness. We provide some intuition as to why this occurs via a set of theoretical results in the case of univariate classifiers. We also find that over-sampling underrepresented groups can result in classifiers which exhibit greater bias to those groups. Lastly, we observe that fairness-aware sampling strategies (i.e., those which are specifically designed to select data with high downstream fairness) will often over-sample members of majority groups. These results demonstrate that the relationship between dataset representativeness and downstream classifier fairness is complex; balancing these two quantities requires special care from both model- and dataset-designers.

A Dataset Representativeness Metric and A Slicing Sampling Strategy for the Kennard-Stone Algorithm

A New Sequential Sampling Method for Surrogate Modeling Based on a Hybrid Metric

Sample Weighting: an Inherent Approach for Outlier Suppressing Discriminant Analysis

Subsampling and Jackknifing: A Practically Convenient Solution for Large Data Analysis with Limited Computational Resources

Adaptive Sampling Strategies to Construct Equitable Training Datasets

Representative Selection Based on Sparse Modeling.

Sample size determination for multidimensional parameters and the A-optimal subsampling in a big data linear regression model

Relationship-aware Multivariate Sampling Strategy for Scientific Simulation Data

Projection-Uniform Subsampling Methods for Big Data

Dataset Representativeness and Downstream Task Fairness

Feature Screening for Massive Data Analysis by Subsampling

Dataset Quantization with Active Learning based Adaptive Sampling

A Recursive Subdivision Technique for Sampling Multi-class Scatterplots.

A Weighted K-Center Algorithm for Data Subset Selection

Leveraging Discarded Samples for Tighter Estimation of Multiple-Set Aggregates

Sample Contribution Pattern Based Big Data Mining Optimization Algorithms

Layered Sampling for Robust Optimization Problems

CDFRS: A scalable sampling approach for efficient big data analysis

Robust and efficient subsampling algorithms for massive data logistic regression

Iterative Subsampling in Solution Path Clustering of Noisy Big Data

Efficient Approaches to K Representative G-Skyline Queries