A Theoretical Analysis of Recommendation Loss Functions under Negative Sampling

Giulia Di Teodoro,Federico Siciliano,Nicola Tonellotto,Fabrizio Silvestri
2024-11-12
Abstract:Recommender Systems (RSs) are pivotal in diverse domains such as e-commerce, music streaming, and social media. This paper conducts a comparative analysis of prevalent loss functions in RSs: Binary Cross-Entropy (BCE), Categorical Cross-Entropy (CCE), and Bayesian Personalized Ranking (BPR). Exploring the behaviour of these loss functions across varying negative sampling settings, we reveal that BPR and CCE are equivalent when one negative sample is used. Additionally, we demonstrate that all losses share a common global minimum. Evaluation of RSs mainly relies on ranking metrics known as Normalized Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR). We produce bounds of the different losses for negative sampling settings to establish a probabilistic lower bound for NDCG. We show that the BPR bound on NDCG is weaker than that of BCE, contradicting the common assumption that BPR is superior to BCE in RSs training. Experiments on five datasets and four models empirically support these theoretical findings. Our code is available at \url{<a class="link-external link-https" href="https://anonymous.4open.science/r/recsys_losses" rel="external noopener nofollow">this https URL</a>} .
Information Retrieval
What problem does this paper attempt to address?
This paper attempts to solve the performance and optimization problems of different loss functions in recommender systems (RSs) under the negative sampling setting. Specifically, the author conducts a theoretical analysis of three commonly - used loss functions - Binary Cross - Entropy (BCE), Categorical Cross - Entropy (CCE), and Bayesian Personalized Ranking (BPR), explores their behaviors under different negative sampling strategies, and reveals the following key issues: 1. **Equivalence of loss functions**: When only one negative sample is used, BPR and CCE are equivalent. In addition, all three loss functions share the same global minimum under certain conditions. 2. **Relationship between loss functions and ranking metrics**: By establishing the relationship between loss functions and ranking metrics (such as Normalized Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR)), the author proves that optimizing these loss functions is actually equivalent to maximizing the lower bound of NDCG or MRR. However, in the case of negative sampling, this relationship is probabilistic. 3. **Boundary comparison of different loss functions**: The author derives the probability lower bounds of each loss function, and by comparing these lower bounds, finds that BCE is more conducive to improving NDCG than BPR and CCE in some cases. Specifically, in extreme cases, the lower bound of CCE for NDCG is weaker, followed by BPR, and BCE is the strongest. 4. **Experimental verification**: Experiments on five datasets and four models prove that the above - mentioned theoretical analysis is consistent with the actual training results, especially in the later stage of training, the loss functions optimize the meaningful NDCG lower bound. ### Formula summary - **BCE loss function**: \[ L_{\text{BCE}} = -\sum_{u = 1}^U \ell_u^{\text{BCE}} \] \[ \ell_u^{\text{BCE}} = \log \sigma(s_{u, i^+})+\sum_{i \in I^-_u} \log (1 - \sigma(s_{u, i})) \] - **CCE loss function**: \[ L_{\text{CCE}} = -\sum_{u = 1}^U \ell_u^{\text{CCE}} \] \[ \ell_u^{\text{CCE}} = \log \left(\frac{e^{s_{u, i^+}}}{e^{s_{u, i^+}}+\sum_{i \in I^-_u} e^{s_{u, i}}}\right) \] - **BPR loss function**: \[ L_{\text{BPR}} = -\sum_{u = 1}^U \ell_u^{\text{BPR}} \] \[ \ell_u^{\text{BPR}} = \sum_{i \in I^-_u} \log \sigma(s_{u, i^+}-s_{u, i}) \] - **NDCG metric**: \[ \text{NDCG}(r^+)=\frac{1}{\log_2(1 + r^+)} \] - **MRR metric**: \[ \text{MRR}(r^+)=\frac{1}{r^+} \] ### Conclusion This paper reveals the performance differences of BCE, CCE, and BPR in recommender systems through theoretical analysis and experimental verification of different loss functions under negative sampling conditions. In particular, it shows that in some cases, BCE may be more suitable for improving NDCG than BPR and CCE, and this conclusion provides a theoretical basis for selecting appropriate loss functions.