Abstract:Multi-modal data presents a promising opportunity for improving multimedia recommendation models, but it also introduces task-irrelevant noise that can reduce model robustness. In this paper, we propose a robust multi-modal recommendation approach that accounts for different levels of task-irrelevant noise across modalities. We explicitly consider the uncertainty associated with each modality and perform stochastic sampling-based fusion according to the precision of different modalities, which serves as a measure of uncertainty. The influence of noisy modalities with high uncertainty is removed, filtering out task-irrelevant noise, and therefore a noise-robust multi-modal recommendation is achieved. Moreover, the stochastic sampling strategy intrinsically considers and simulates scenarios with absent modalities during multi-modal fusion. Consequently, it incorporates additional randomness into the training process, which enables the model to handle the problem of modality missing. Furthermore, the proposed fusion approach integrates the noise robustness of the Product-of-Experts (PoE) framework when modeling with Gaussian distributions, along with the flexibility of the Mixture-of-Experts (MoE) technique to represent diverse distributions of latent variables. This integration allows the proposed approach to achieve noise-robust modeling with non-Gaussian variables. Specifically, we derive a solvable evidence lower bound for the proposed variational mixture of stochastic experts (VMoSE) auto-encoder, where both Gaussian and Student-T distributions are used to model the latent variables. Constraints are added to match the similarities between the ID embeddings and the multi-modal joint embeddings by utilizing an Expectation maximization (EM)-style algorithm for better model optimization. Extensive experiments demonstrate the effectiveness of the proposed method in multi-modal fusion and the robustness to modality noise and modality missing.

Variational Mixture of Stochastic Experts Auto-encoder for Multi-modal Recommendation

Multi-Modal Variational Graph Auto-Encoder for Recommendation Systems

GaVaMoE: Gaussian-Variational Gated Mixture of Experts for Explainable Recommendation

Multimodal Variational Autoencoders for Semi-Supervised Learning: In Defense of Product-of-Experts

Enhanced Experts with Uncertainty-Aware Routing for Multimodal Sentiment Analysis

HMoE: Heterogeneous Mixture of Experts for Language Modeling

A Multimodal Variational Encoder-Decoder Framework for Micro-video Popularity Prediction

Cross-modal Variational Auto-encoder for Content-based Micro-video Background Music Recommendation

Multi-modal Intent Detection with LVAMoE: the Language-Visual-Audio Mixture of Experts

Enhancing Adversarial Robustness of Multi-modal Recommendation via Modality Balancing

MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts

Generalized Product-of-Experts for Learning Multimodal Representations in Noisy Environments

Debunking Free Fusion Myth: Online Multi-view Anomaly Detection with Disentangled Product-of-Experts Modeling

Double-Wing Mixture of Experts for Streaming Recommendations

Uncertainty-Debiased Multimodal Fusion: Learning Deterministic Joint Representation for Multimodal Sentiment Analysis

What Makes for Robust Multi-Modal Models in the Face of Missing Modalities?

A Markov Random Field Multi-Modal Variational AutoEncoder

MoDE: A Mixture-of-Experts Model with Mutual Distillation among the Experts

Predicting the Popularity of Micro-videos with Multimodal Variational Encoder-Decoder Framework

Multimodal Weibull Variational Autoencoder for Jointly Modeling Image-Text Data