Abstract:Multimedia contents are of predominance in the modern Web era. Recent years have witnessed growing research interests in multimedia recommendation, which aims to predict whether a user will interact with an item with multimodal contents. Most previous studies focus on modeling user-item interactions with multimodal features included as side information. However, this scheme is not well-designed for multimedia recommendation. First, only collaborative item-item relationships are implicitly modeled through high-order item-user-item co-occurrences. Considering that items are associated with rich contents in multiple modalities, we argue that the latent semantic item-item structures underlying these multimodal contents could be beneficial for learning better item representations and assist the recommender models to comprehensively discover candidate items. Second, although previous studies consider multiple modalities, their ways of fusing multiple modalities by linear combination or concatenation is insufficient to fully capture content information of items and item relationships. To address these deficiencies, we propose a latent structure MIning with ContRastive mOdality fusion model, which we term MICRO for brevity. To be specific, we devise a novel modality-aware structure learning module, which learns item-item relationships for each modality. Based on the learned modality-aware latent item relationships, we perform graph convolutions to explicitly inject item affinities into modality-aware item representations. Additionally, we design a novel multimodal contrastive framework to facilitate item-level multimodal fusion by mining both modality-shared and modality-specific information. Finally, the item representations are plugged into existing collaborative filtering methods to make accurate recommendation. Extensive experiments on three real-world datasets demonstrate the superiority of our method over state-of-arts and rationalize the design choice of our work.

Multi-modal Graph and Sequence Fusion Learning for Recommendation.

Multi-modal Recommendation Based on Knowledge Graph

Attention-guided Multi-step Fusion: A Hierarchical Fusion Network for Multimodal Recommendation

Multi-View Graph Convolutional Network for Multimedia Recommendation

MM-GEF: Multi-modal representation meet collaborative filtering

Graph Neural Networks with Deep Mutual Learning for Designing Multi-modal Recommendation Systems

Multimodal Difference Learning for Sequential Recommendation

Multi-feature fused collaborative attention network for sequential recommendation with semantic-enriched contrastive learning

MMMLP: Multi-modal Multilayer Perceptron for Sequential Recommendations

MM-FRec: Multi-Modal Enhanced Fashion Item Recommendation

Dual-view multi-modal contrastive learning for graph-based recommender systems

CMBF: Cross-Modal-Based Fusion Recommendation Algorithm

Multimodal Graph Contrastive Learning for Multimedia-Based Recommendation

Multi-Graph Heterogeneous Interaction Fusion for Social Recommendation

Graph Heterogeneous Multi-Relational Recommendation

Latent Structure Mining With Contrastive Modality Fusion for Multimedia Recommendation

Adaptive Fusion of Multi-View for Graph Contrastive Recommendation

Self-Supervised Multi-Modal Sequential Recommendation

Multi-dimensional Shared Representation Learning with Graph Fusion Network for Session-based Recommendation

MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video

GCN recommendation model based on the fusion of dynamic multiple-view latent interest topics