Abstract:Multi-modal recommendation systems, which integrate diverse types of information, have gained widespread attention in recent years. However, compared to traditional collaborative filtering-based multi-modal recommendation systems, research on multi-modal sequential recommendation is still in its nascent stages. Unlike traditional sequential recommendation models that solely rely on item identifier (ID) information and focus on network structure design, multi-modal recommendation models need to emphasize item representation learning and the fusion of heterogeneous data sources. This paper investigates the impact of item representation learning on downstream recommendation tasks and examines the disparities in information fusion at different stages. Empirical experiments are conducted to demonstrate the need to design a framework suitable for collaborative learning and fusion of diverse information. Based on this, we propose a new model-agnostic framework for multi-modal sequential recommendation tasks, called Online Distillation-enhanced Multi-modal Transformer (ODMT), to enhance feature interaction and mutual learning among multi-source input (ID, text, and image), while avoiding conflicts among different features during training, thereby improving recommendation accuracy. To be specific, we first introduce an ID-aware Multi-modal Transformer module in the item representation learning stage to facilitate information interaction among different features. Secondly, we employ an online distillation training strategy in the prediction optimization stage to make multi-source data learn from each other and improve prediction robustness. Experimental results on a stream media recommendation dataset and three e-commerce recommendation datasets demonstrate the effectiveness of the proposed two modules, which is approximately 10% improvement in performance compared to baseline models.

<i>M</i><SUP>3</SUP>-IB: A Memory-Augment Multi-modal Information Bottleneck Model for Next-Item Recommendation

$$M^3$$ -IB: A Memory-Augment Multi-modal Information Bottleneck Model for Next-Item Recommendation

Sequential Modeling of Hierarchical User Intention and Preference for Next-item Recommendation

Contrastive Intra- and Inter-Modality Generation for Enhancing Incomplete Multimedia Recommendation

Interest-Related Item Similarity Model Based on Multimodal Data for Top-N Recommendation

Multi-modal Recommendation Based on Knowledge Graph

A multi-information enhanced attention network for session-based recommendation

M2: Mixed Models With Preferences, Popularities and Transitions for Next-Basket Recommendation

Memory Augmented Multi-Instance Contrastive Predictive Coding for Sequential Recommendation

Beyond Co-occurrence: Multi-modal Session-based Recommendation

Multimodal Information Bottleneck: Learning Minimal Sufficient Unimodal and Multimodal Representations

Multimodal Difference Learning for Sequential Recommendation

Online Distillation-enhanced Multi-modal Transformer for Sequential Recommendation

Multimodal Conditioned Diffusion Model for Recommendation

Next-item Recommendation with Bidirectional Encoder Representations from Transformer and Matrix Factorization

Attention-guided Multi-step Fusion: A Hierarchical Fusion Network for Multimodal Recommendation

Multi-Modal Recommendation System with Auxiliary Information

An Empirical Study of Training ID-Agnostic Multi-modal Sequential Recommenders

Learning a Hierarchical Intent Model for Next-Item Recommendation

Multimodal Interactive Network for Sequential Recommendation

Item-Based Collaborative Memory Networks for Recommendation