Multimodal Pre-training Framework for Sequential Recommendation via Contrastive Learning

Lingzi Zhang,Xin Zhou,Zhiwei Zeng,Zhiqi Shen
2024-07-22
Abstract:Current multimodal sequential recommendation models are often unable to effectively explore and capture correlations among behavior sequences of users and items across different modalities, either neglecting correlations among sequence representations or inadequately capturing associations between multimodal data and sequence data in their representations. To address this problem, we explore multimodal pre-training in the context of sequential recommendation, with the aim of enhancing fusion and utilization of multimodal information. We propose a novel Multimodal Pre-training for Sequential Recommendation (MP4SR) framework, which utilizes contrastive losses to capture the correlation among different modality sequences of users, as well as the correlation among different modality sequences of users and items. MP4SR consists of three key components: 1) multimodal feature extraction, 2) a backbone network, Multimodal Mixup Sequence Encoder (M2SE), and 3) pre-training tasks. After utilizing pre-trained encoders to generate initial multimodal features of items, M2SE adopts a complementary sequence mixup strategy to fuse different modality sequences, and leverages contrastive learning to capture modality interactions at the sequence-to-sequence and sequence-to-item levels. Extensive experiments on four real-world datasets demonstrate that MP4SR outperforms state-of-the-art approaches in both normal and cold-start settings. We further highlight the efficacy of incorporating multimodal pre-training in sequential recommendation representation learning, serving as an effective regularizer and optimizing the parameter space for the recommendation task.
Information Retrieval,Multimedia
What problem does this paper attempt to address?
The paper aims to address the issue of data sparsity in sequential recommendation systems. Specifically, existing sequential recommendation methods often perform poorly when relying solely on user behavior data due to data sparsity. Although multimodal content (such as images and text descriptions) has been used to mitigate this issue, integrating this information within a sequential recommendation framework remains challenging. Current multimodal sequential recommendation models often fail to effectively explore and capture the correlations between user and item behavior sequences across different modalities. To address these issues, the authors propose a new multimodal pre-training framework called MP4SR (Multimodal Pre-training for Sequential Recommendation), which utilizes contrastive loss to capture the correlations between user behavior sequences in different modalities and between user and item modality sequences. MP4SR consists of three key components: 1. **Multimodal Feature Extraction**: Extracting initial text and image features from items. 2. **Backbone Network M2SE (Multimodal Mixup Sequence Encoder)**: Adopting a complementary sequence mixup strategy to fuse sequences from different modalities and using contrastive learning to capture interactions between modalities. 3. **Pre-training Tasks**: Optimizing model parameters through contrastive learning to improve performance on downstream tasks. Experimental results show that MP4SR outperforms existing state-of-the-art methods on four real-world datasets, excelling in both normal and cold-start settings. Additionally, the study demonstrates the effectiveness of incorporating multimodal pre-training into sequential recommendation representation learning, serving as an effective regularizer to optimize the parameter space of the recommendation task.