Multimodal Pre-training Framework for Sequential Recommendation via Contrastive Learning

Lingzi Zhang,Xin Zhou,Zhiwei Zeng,Zhiqi Shen

2024-07-22

Abstract:Current multimodal sequential recommendation models are often unable to effectively explore and capture correlations among behavior sequences of users and items across different modalities, either neglecting correlations among sequence representations or inadequately capturing associations between multimodal data and sequence data in their representations. To address this problem, we explore multimodal pre-training in the context of sequential recommendation, with the aim of enhancing fusion and utilization of multimodal information. We propose a novel Multimodal Pre-training for Sequential Recommendation (MP4SR) framework, which utilizes contrastive losses to capture the correlation among different modality sequences of users, as well as the correlation among different modality sequences of users and items. MP4SR consists of three key components: 1) multimodal feature extraction, 2) a backbone network, Multimodal Mixup Sequence Encoder (M2SE), and 3) pre-training tasks. After utilizing pre-trained encoders to generate initial multimodal features of items, M2SE adopts a complementary sequence mixup strategy to fuse different modality sequences, and leverages contrastive learning to capture modality interactions at the sequence-to-sequence and sequence-to-item levels. Extensive experiments on four real-world datasets demonstrate that MP4SR outperforms state-of-the-art approaches in both normal and cold-start settings. We further highlight the efficacy of incorporating multimodal pre-training in sequential recommendation representation learning, serving as an effective regularizer and optimizing the parameter space for the recommendation task.

Information Retrieval,Multimedia

What problem does this paper attempt to address?

The paper aims to address the issue of data sparsity in sequential recommendation systems. Specifically, existing sequential recommendation methods often perform poorly when relying solely on user behavior data due to data sparsity. Although multimodal content (such as images and text descriptions) has been used to mitigate this issue, integrating this information within a sequential recommendation framework remains challenging. Current multimodal sequential recommendation models often fail to effectively explore and capture the correlations between user and item behavior sequences across different modalities. To address these issues, the authors propose a new multimodal pre-training framework called MP4SR (Multimodal Pre-training for Sequential Recommendation), which utilizes contrastive loss to capture the correlations between user behavior sequences in different modalities and between user and item modality sequences. MP4SR consists of three key components: 1. **Multimodal Feature Extraction**: Extracting initial text and image features from items. 2. **Backbone Network M2SE (Multimodal Mixup Sequence Encoder)**: Adopting a complementary sequence mixup strategy to fuse sequences from different modalities and using contrastive learning to capture interactions between modalities. 3. **Pre-training Tasks**: Optimizing model parameters through contrastive learning to improve performance on downstream tasks. Experimental results show that MP4SR outperforms existing state-of-the-art methods on four real-world datasets, excelling in both normal and cold-start settings. Additionally, the study demonstrates the effectiveness of incorporating multimodal pre-training into sequential recommendation representation learning, serving as an effective regularizer to optimize the parameter space of the recommendation task.

Multimodal Pre-training Framework for Sequential Recommendation via Contrastive Learning

Multi-Modal Contrastive Pre-training for Recommendation

Contrastive Pre-training for Sequential Recommendation

Contrastive Intra- and Inter-Modality Generation for Enhancing Incomplete Multimedia Recommendation

Multimodal Difference Learning for Sequential Recommendation

MISSRec: Pre-training and Transferring Multi-modal Interest-aware Sequence Representation for Recommendation

Multi-level Contrastive Learning Framework for Sequential Recommendation

MMMLP: Multi-modal Multilayer Perceptron for Sequential Recommendations

Temporal Contrastive Pre-Training for Sequential Recommendation.

Self-Supervised Multi-Modal Sequential Recommendation

MoCo4SRec: A momentum contrastive learning framework for sequential recommendation

Multimodal Pretraining, Adaptation, and Generation for Recommendation: A Survey

An Empirical Study of Training ID-Agnostic Multi-modal Sequential Recommenders

Multimodal Meta-Learning for Cold-Start Sequential Recommendation.

Adaptive Multi-Modalities Fusion in Sequential Recommendation Systems

End-to-end training of Multimodal Model and ranking Model

MCL4SRec: A Sequential Recommendation Model with Multi-level Contrastive Learning

Sequential Recommendation with a Pre-trained Module Learning Multi-modal Information

Modality Matches Modality: Pretraining Modality-Disentangled Item Representations for Recommendation

Prompt-based and Weak-Modality Enhanced Multimodal Recommendation

Multi-modal Mixture of Experts Represetation Learning for Sequential Recommendation