Abstract:Recently sequential recommendations and next-item prediction task has become increasingly popular in the field of recommender systems. Currently, two state-of-the-art baselines are Transformer-based models SASRec and BERT4Rec. Over the past few years, there have been quite a few publications comparing these two algorithms and proposing new state-of-the-art models. In most of the publications, BERT4Rec achieves better performance than SASRec. But BERT4Rec uses cross-entropy over softmax for all items, while SASRec uses negative sampling and calculates binary cross-entropy loss for one positive and one negative item. In our work, we show that if both models are trained with the same loss, which is used by BERT4Rec, then SASRec will significantly outperform BERT4Rec both in terms of quality and training speed. In addition, we show that SASRec could be effectively trained with negative sampling and still outperform BERT4Rec, but the number of negative examples should be much larger than one.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to evaluate and compare the performance of two Transformer - based sequence recommendation models, SASRec and BERT4Rec, under different loss functions. Specifically, the author focuses on the following points: 1. **Influence of Loss Functions**: - BERT4Rec uses cross - entropy loss over softmax calculated for all items, while SASRec usually uses negative sampling and only calculates binary cross - entropy loss for one positive sample and one negative sample. - The author studies the performance differences between the two models when they both use the same loss function (i.e., the cross - entropy loss used by BERT4Rec). 2. **Model Performance Comparison**: - The paper verifies that if SASRec is also trained with the same loss function as BERT4Rec, it can not only be significantly superior to BERT4Rec in quality, but also faster in training speed. - In addition, the author also explores whether SASRec can be effectively trained by negative sampling and cross - entropy loss, and points out that in this case, the number of negative samples needs to be much greater than 1. 3. **Training Efficiency**: - The author finds that BERT4Rec requires more training time and iterations to reach an acceptable performance level, while SASRec shows a faster convergence speed. ### Main Conclusions Through experiments, the author draws the following conclusions: - SASRec trained with full - scale cross - entropy loss (or sampling cross - entropy loss with a large number of negative samples), denoted as SASRec +, outperforms BERT4Rec on multiple datasets. - SASRec + is not only superior to BERT4Rec in performance, but also faster in training speed. - For some datasets, even the classic GRU4Rec model can compete with BERT4Rec, indicating that unidirectional causal modeling is more suitable for sequence recommendation tasks than the bidirectional masking method. ### Formula Summary - **Original Loss Function of SASRec** (Binary Cross - Entropy Loss): \[ L_{\text{BCE}} = -\sum_{u \in U} \sum_{t = 1}^{n_u} \left[ \log(\sigma(r_t^{(u), i_t}))+\log(1 - \sigma(r_t^{(u), -})) \right] \] where $\sigma()$ is the sigmoid function, $r_t^{(u), i_t}$ is the predicted relevance score, $i_t$ is the true positive sample, and $-$ represents the negative sample. - **Loss Function of BERT4Rec** (Cross - Entropy Loss): \[ L_{\text{CE}} = -\sum_{u \in U} \sum_{t \in T_u} \log \frac{\exp(r_t^{(u), i_t})}{\sum_{i \in I} \exp(r_t^{(u), i})} \] where $T_u$ is the set of time steps containing masked items, and $I$ is the set of all items. - **Cross - Entropy Loss with Negative Sampling**: \[ L_{\text{CE - sample}} = -\sum_{u \in U} \sum_{t = 1}^{n_u} \log \frac{\exp(r_t^{(u), i_t})}{\exp(r_t^{(u), i_t})+\sum_{i \in I_{-}(u)^N} \exp(r_t^{(u), i})} \] where $I_{-}(u)^N$ is $N$ negative samples sampled from items that the user has not interacted with. These results indicate that choosing an appropriate loss function is crucial for model performance, especially in sequence recommendation tasks.

Turning Dross Into Gold Loss: is BERT4Rec really better than SASRec?

gSASRec: Reducing Overconfidence in Sequential Recommendation Trained with Negative Sampling

A Systematic Review and Replicability Study of BERT4Rec for Sequential Recommendation

Aligning GPTRec with Beyond-Accuracy Goals with Reinforcement Learning

A Theoretical Analysis of Recommendation Loss Functions under Negative Sampling

BSL: Understanding and Improving Softmax Loss for Recommendation

Efficient Inference of Sub-Item Id-based Sequential Recommendation Models with Millions of Items

BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer

Loss Aversion in Recommender Systems: Utilizing Negative User Preference to Improve Recommendation Quality

Enhancing Transformers without Self-supervised Learning: A Loss Landscape Perspective in Sequential Recommendation

On the Effectiveness of Sampled Softmax Loss for Item Recommendation

Batch-Mix Negative Sampling for Learning Recommendation Retrievers

Personalized Ranking with Importance Sampling.

Generative Sequential Recommendation with GPTRec

RECE: Reduced Cross-Entropy Loss for Large-Catalogue Sequential Recommenders

CROLoss: Towards a Customizable Loss for Retrieval Models in Recommender Systems

TT-Rec: Tensor Train Compression for Deep Learning Recommendation Models

Multi-modality Meets Re-learning: Mitigating Negative Transfer in Sequential Recommendation

Personalized Negative Reservoir for Incremental Learning in Recommender Systems

Evaluating Performance and Bias of Negative Sampling in Large-Scale Sequential Recommendation Models

Understanding the Ranking Loss for Recommendation with Sparse User Feedback