Measuring Fairness in Large-Scale Recommendation Systems with Missing Labels

Yulong Dong,Kun Jin,Xinghai Hu,Yang Liu
2024-06-08
Abstract:In large-scale recommendation systems, the vast array of items makes it infeasible to obtain accurate user preferences for each product, resulting in a common issue of missing labels. Typically, only items previously recommended to users have associated ground truth data. Although there is extensive research on fairness concerning fully observed user-item interactions, the challenge of fairness in scenarios with missing labels remains underexplored. Previous methods often treat these samples missing labels as negative, which can significantly deviate from the ground truth fairness metrics. Our study addresses this gap by proposing a novel method employing a small randomized traffic to estimate fairness metrics accurately. We present theoretical bounds for the estimation error of our fairness metric and support our findings with empirical evidence on real data. Our numerical experiments on synthetic and TikTok's real-world data validate our theory and show the efficiency and effectiveness of our novel methods. To the best of our knowledge, we are the first to emphasize the necessity of random traffic in dataset collection for recommendation fairness, the first to publish a fairness-related dataset from TikTok and to provide reliable estimates of fairness metrics in the context of large-scale recommendation systems with missing labels.
Information Retrieval
What problem does this paper attempt to address?
The paper primarily addresses the issue of fairness measurement in large-scale recommendation systems, particularly the challenges in handling missing labels. Specifically, the paper focuses on how to accurately measure the fairness of recommendation systems when the true preferences of users for most items are unknown. The paper points out that in large-scale recommendation systems, due to the wide variety of items, it is difficult to obtain accurate preference information for every user on all items, leading to a large number of missing preference labels for user-item pairs. Traditional methods often treat these missing labels as negative samples, which can result in significant bias in fairness metrics. To overcome this limitation, the authors propose a novel approach that utilizes random traffic data to estimate fairness metrics and demonstrate the effectiveness and necessity of this method. The main contributions include: 1. **Identifying issues in existing literature**: Demonstrating the necessity of including random traffic data for accurately estimating label-dependent fairness metrics when there are missing preference labels in large-scale recommendation datasets, and showcasing the shortcomings of simplified methods in previous literature. 2. **Providing efficient algorithmic tools**: Proposing effective and practical estimation algorithms for monitoring fairness metrics, with theoretical guarantees on error bounds. 3. **Empirical studies**: Validating the theoretical results through synthetic data and real-world datasets from TikTok, demonstrating the effectiveness and efficiency of the proposed method. Additionally, the paper releases a real-world dataset from TikTok regarding short video recommendations, which is the first dataset from TikTok released for fairness research. By analyzing this dataset, the necessity of using random traffic and the correctness and computational advantages of the proposed estimation methods are verified.