Evaluating Performance and Bias of Negative Sampling in Large-Scale Sequential Recommendation Models

Arushi Prakash,Dimitrios Bermperidis,Srivas Chennu
2024-10-29
Abstract:Large-scale industrial recommendation models predict the most relevant items from catalogs containing millions or billions of options. To train these models efficiently, a small set of irrelevant items (negative samples) is selected from the vast catalog for each relevant item (positive example), helping the model distinguish between relevant and irrelevant items. Choosing the right negative sampling method is a common challenge. We address this by implementing and comparing various negative sampling methods - random, popularity-based, in-batch, mixed, adaptive, and adaptive with mixed variants - on modern sequential recommendation models. Our experiments, including hyperparameter optimization and 20x repeats on three benchmark datasets with varying popularity biases, show how the choice of method and dataset characteristics impact key model performance metrics. We also reveal that average performance metrics often hide imbalances across popularity bands (head, mid, tail). We find that commonly used random negative sampling reinforces popularity bias and performs best for head items. Popularity-based methods (in-batch and global popularity negative sampling) can offer balanced performance at the cost of lower overall model performance results. Our study serves as a practical guide to the trade-offs in selecting a negative sampling method for large-scale sequential recommendation models. Code, datasets, experimental results and hyperparameters are available at: <a class="link-external link-https" href="https://github.com/apple/ml-negative-sampling" rel="external noopener nofollow">this https URL</a>.
Information Retrieval,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to explore and evaluate the performance and bias issues of negative sampling methods in large - scale sequential recommendation models. Specifically, the main research questions include: 1. **The impact of negative sampling selection on model performance**: - Large - scale industrial recommendation systems need to predict the most relevant items from catalogs containing millions or billions of options. To train these models efficiently, a small number of irrelevant items (negative samples) are usually selected for each relevant item (positive sample). Different negative sampling methods will affect the model's ability to distinguish between relevant and irrelevant items. 2. **Differences in the performance of different negative sampling methods on different datasets**: - The research evaluates the performance of these methods on modern sequential recommendation models by implementing and comparing multiple negative sampling methods (random, popularity - based, within - batch, hybrid, adaptive, and their hybrid variants). The experiments cover three benchmark datasets and consider different situations of popularity bias in the datasets. 3. **The impact of popularity bias on model performance**: - The paper points out that average performance metrics often mask the imbalance between different popularity intervals (head, middle, tail). For example, the commonly used random negative sampling method will strengthen the popularity bias, resulting in better prediction results for head items by the model, but poorer prediction results for middle and tail items. The popularity - based negative sampling method can balance this bias to a certain extent, but may reduce the overall model performance. 4. **How to select an appropriate negative sampling method**: - The research provides practical guidelines on how to weigh the selection of negative sampling methods to optimize the recommendation system. Through a detailed comparison of different methods, the research reveals the advantages and limitations of each method, helping practitioners select the most appropriate negative sampling strategy according to the characteristics of the dataset and the model design. ### Formula summary - **Scoring function**: \[ s(e_t^u, e_i) = e_t^{uT} \cdot e_i \] where \(e_t^u\) is the causal embedding of user \(u\) at time \(t\), and \(e_i\) is the embedding representation of item \(i\). - **Interaction probability**: \[ P(i|u; t) = \sigma(s(e_i, e_t^u)) = \frac{1}{1 + e^{-s(e_i, e_t^u)}} \] - **Loss function**: \[ L_u = \sum_{(i^+, t) \in H_{\text{shift}}^u} \left[ -\log(\sigma(s(e_{i^+}, e_t^u))) - \sum_{i^- \in N_t^u} \log(1 - \sigma(s(e_{i^-}, e_t^u))) \right] \] - **Hit rate (HR)**: \[ \text{Hit Rate @k} = \frac{1}{|U|} \sum_{u \in U} 1\{i_t^u \in R(u; k)\} \] - **Hit rate grouped by popularity (HR cohort)**: \[ \text{Hit Rate cohort @k} = \frac{1}{|C|} \sum_{u \in C} 1\{i_t^u \in R(u; k)\} \] - **Balance**: \[ \text{Balance} = 1.0 - \text{Gini}([HR_{\text{head}}, HR_{\text{mid}}, HR_{\text{tail}}]) \] Through the exploration of these problems, the paper provides theoretical basis and practical guidance for the selection of negative sampling methods in large - scale sequential recommendation models.