Variational Mixture of Stochastic Experts Auto-encoder for Multi-modal Recommendation

Jing Yi,Zhenzhong Chen
DOI: https://doi.org/10.1109/tmm.2024.3384058
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:Multi-modal data presents a promising opportunity for improving multimedia recommendation models, but it also introduces task-irrelevant noise that can reduce model robustness. In this paper, we propose a robust multi-modal recommendation approach that accounts for different levels of task-irrelevant noise across modalities. We explicitly consider the uncertainty associated with each modality and perform stochastic sampling-based fusion according to the precision of different modalities, which serves as a measure of uncertainty. The influence of noisy modalities with high uncertainty is removed, filtering out task-irrelevant noise, and therefore a noise-robust multi-modal recommendation is achieved. Moreover, the stochastic sampling strategy intrinsically considers and simulates scenarios with absent modalities during multi-modal fusion. Consequently, it incorporates additional randomness into the training process, which enables the model to handle the problem of modality missing. Furthermore, the proposed fusion approach integrates the noise robustness of the Product-of-Experts (PoE) framework when modeling with Gaussian distributions, along with the flexibility of the Mixture-of-Experts (MoE) technique to represent diverse distributions of latent variables. This integration allows the proposed approach to achieve noise-robust modeling with non-Gaussian variables. Specifically, we derive a solvable evidence lower bound for the proposed variational mixture of stochastic experts (VMoSE) auto-encoder, where both Gaussian and Student-T distributions are used to model the latent variables. Constraints are added to match the similarities between the ID embeddings and the multi-modal joint embeddings by utilizing an Expectation maximization (EM)-style algorithm for better model optimization. Extensive experiments demonstrate the effectiveness of the proposed method in multi-modal fusion and the robustness to modality noise and modality missing.
What problem does this paper attempt to address?