Abstract:In large-scale recommendation systems, the vast array of items makes it infeasible to obtain accurate user preferences for each product, resulting in a common issue of missing labels. Typically, only items previously recommended to users have associated ground truth data. Although there is extensive research on fairness concerning fully observed user-item interactions, the challenge of fairness in scenarios with missing labels remains underexplored. Previous methods often treat these samples missing labels as negative, which can significantly deviate from the ground truth fairness metrics. Our study addresses this gap by proposing a novel method employing a small randomized traffic to estimate fairness metrics accurately. We present theoretical bounds for the estimation error of our fairness metric and support our findings with empirical evidence on real data. Our numerical experiments on synthetic and TikTok's real-world data validate our theory and show the efficiency and effectiveness of our novel methods. To the best of our knowledge, we are the first to emphasize the necessity of random traffic in dataset collection for recommendation fairness, the first to publish a fairness-related dataset from TikTok and to provide reliable estimates of fairness metrics in the context of large-scale recommendation systems with missing labels.

What problem does this paper attempt to address?

The paper primarily addresses the issue of fairness measurement in large-scale recommendation systems, particularly the challenges in handling missing labels. Specifically, the paper focuses on how to accurately measure the fairness of recommendation systems when the true preferences of users for most items are unknown. The paper points out that in large-scale recommendation systems, due to the wide variety of items, it is difficult to obtain accurate preference information for every user on all items, leading to a large number of missing preference labels for user-item pairs. Traditional methods often treat these missing labels as negative samples, which can result in significant bias in fairness metrics. To overcome this limitation, the authors propose a novel approach that utilizes random traffic data to estimate fairness metrics and demonstrate the effectiveness and necessity of this method. The main contributions include: 1. **Identifying issues in existing literature**: Demonstrating the necessity of including random traffic data for accurately estimating label-dependent fairness metrics when there are missing preference labels in large-scale recommendation datasets, and showcasing the shortcomings of simplified methods in previous literature. 2. **Providing efficient algorithmic tools**: Proposing effective and practical estimation algorithms for monitoring fairness metrics, with theoretical guarantees on error bounds. 3. **Empirical studies**: Validating the theoretical results through synthetic data and real-world datasets from TikTok, demonstrating the effectiveness and efficiency of the proposed method. Additionally, the paper releases a real-world dataset from TikTok regarding short video recommendations, which is the first dataset from TikTok released for fairness research. By analyzing this dataset, the necessity of using random traffic and the correctness and computational advantages of the proposed estimation methods are verified.

Measuring Fairness in Large-Scale Recommendation Systems with Missing Labels

Covering Diversification and Fairness for Better Recommendation (Short Paper).

FairRec: Fairness Testing for Deep Recommender Systems

Fairness in Recommendation Ranking through Pairwise Comparisons

Fairness in Recommendation: A Survey

Towards Long-term Fairness in Recommendation

Fairness in Recommendation: Foundations, Methods and Applications

New Fairness Metrics for Recommendation that Embrace Differences

User-oriented Fairness in Recommendation

Fair Recommendations with Limited Sensitive Attributes: A Distributionally Robust Optimization Approach

A Survey on the Fairness of Recommender Systems

Improving Recommendation Fairness via Data Augmentation

Make Fairness More Fair: Fair Item Utility Estimation and Exposure Re-Distribution

Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large Language Model Recommendation

On the Fairness of Randomized Trials for Recommendation with Heterogeneous Demographics and Beyond

Fairness in Recommender Systems: Evaluation Approaches and Assurance Strategies

Social Recommendation with Missing Not at Random Data

Intersectional Two-sided Fairness in Recommendation

Tutorial on Fairness of Machine Learning in Recommender Systems

When Fairness Meets Bias: a Debiased Framework for Fairness Aware Top-N Recommendation.

Fairness in Ranking, Part II: Learning-to-Rank and Recommender Systems