Turning Dross Into Gold Loss: is BERT4Rec really better than SASRec?

Anton Klenitskiy,Alexey Vasilev
DOI: https://doi.org/10.1145/3604915.3610644
2023-09-14
Abstract:Recently sequential recommendations and next-item prediction task has become increasingly popular in the field of recommender systems. Currently, two state-of-the-art baselines are Transformer-based models SASRec and BERT4Rec. Over the past few years, there have been quite a few publications comparing these two algorithms and proposing new state-of-the-art models. In most of the publications, BERT4Rec achieves better performance than SASRec. But BERT4Rec uses cross-entropy over softmax for all items, while SASRec uses negative sampling and calculates binary cross-entropy loss for one positive and one negative item. In our work, we show that if both models are trained with the same loss, which is used by BERT4Rec, then SASRec will significantly outperform BERT4Rec both in terms of quality and training speed. In addition, we show that SASRec could be effectively trained with negative sampling and still outperform BERT4Rec, but the number of negative examples should be much larger than one.
Information Retrieval,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to evaluate and compare the performance of two Transformer - based sequence recommendation models, SASRec and BERT4Rec, under different loss functions. Specifically, the author focuses on the following points: 1. **Influence of Loss Functions**: - BERT4Rec uses cross - entropy loss over softmax calculated for all items, while SASRec usually uses negative sampling and only calculates binary cross - entropy loss for one positive sample and one negative sample. - The author studies the performance differences between the two models when they both use the same loss function (i.e., the cross - entropy loss used by BERT4Rec). 2. **Model Performance Comparison**: - The paper verifies that if SASRec is also trained with the same loss function as BERT4Rec, it can not only be significantly superior to BERT4Rec in quality, but also faster in training speed. - In addition, the author also explores whether SASRec can be effectively trained by negative sampling and cross - entropy loss, and points out that in this case, the number of negative samples needs to be much greater than 1. 3. **Training Efficiency**: - The author finds that BERT4Rec requires more training time and iterations to reach an acceptable performance level, while SASRec shows a faster convergence speed. ### Main Conclusions Through experiments, the author draws the following conclusions: - SASRec trained with full - scale cross - entropy loss (or sampling cross - entropy loss with a large number of negative samples), denoted as SASRec +, outperforms BERT4Rec on multiple datasets. - SASRec + is not only superior to BERT4Rec in performance, but also faster in training speed. - For some datasets, even the classic GRU4Rec model can compete with BERT4Rec, indicating that unidirectional causal modeling is more suitable for sequence recommendation tasks than the bidirectional masking method. ### Formula Summary - **Original Loss Function of SASRec** (Binary Cross - Entropy Loss): \[ L_{\text{BCE}} = -\sum_{u \in U} \sum_{t = 1}^{n_u} \left[ \log(\sigma(r_t^{(u), i_t}))+\log(1 - \sigma(r_t^{(u), -})) \right] \] where $\sigma()$ is the sigmoid function, $r_t^{(u), i_t}$ is the predicted relevance score, $i_t$ is the true positive sample, and $-$ represents the negative sample. - **Loss Function of BERT4Rec** (Cross - Entropy Loss): \[ L_{\text{CE}} = -\sum_{u \in U} \sum_{t \in T_u} \log \frac{\exp(r_t^{(u), i_t})}{\sum_{i \in I} \exp(r_t^{(u), i})} \] where $T_u$ is the set of time steps containing masked items, and $I$ is the set of all items. - **Cross - Entropy Loss with Negative Sampling**: \[ L_{\text{CE - sample}} = -\sum_{u \in U} \sum_{t = 1}^{n_u} \log \frac{\exp(r_t^{(u), i_t})}{\exp(r_t^{(u), i_t})+\sum_{i \in I_{-}(u)^N} \exp(r_t^{(u), i})} \] where $I_{-}(u)^N$ is $N$ negative samples sampled from items that the user has not interacted with. These results indicate that choosing an appropriate loss function is crucial for model performance, especially in sequence recommendation tasks.