Abstract:With the growing success of multi-modal learning, research on the robustness of multi-modal models, especially when facing situations with missing modalities, is receiving increased attention. Nevertheless, previous studies in this domain exhibit certain limitations, as they often lack theoretical insights or their methodologies are tied to specific network architectures or modalities. We model the scenarios of multi-modal models encountering missing modalities from an information-theoretic perspective and illustrate that the performance ceiling in such scenarios can be approached by efficiently utilizing the information inherent in non-missing modalities. In practice, there are two key aspects: (1) The encoder should be able to extract sufficiently good features from the non-missing modality; (2) The extracted features should be robust enough not to be influenced by noise during the fusion process across modalities. To this end, we introduce Uni-Modal Ensemble with Missing Modality Adaptation (UME-MMA). UME-MMA employs uni-modal pre-trained weights for the multi-modal model to enhance feature extraction and utilizes missing modality data augmentation techniques to better adapt to situations with missing modalities. Apart from that, UME-MMA, built on a late-fusion learning framework, allows for the plug-and-play use of various encoders, making it suitable for a wide range of modalities and enabling seamless integration of large-scale pre-trained encoders to further enhance performance. And we demonstrate UME-MMA's effectiveness in audio-visual datasets (e.g., AV-MNIST, Kinetics-Sound, AVE) and vision-language datasets (e.g., MM-IMDB, UPMC Food101).

Multimodal Generative Models for Scalable Weakly-Supervised Learning

Multimodal Generative Models for Compositional Representation Learning

Joint Multimodal Learning with Deep Generative Models

Generalizing Multimodal Variational Methods to Sets

MHVAE: a Human-Inspired Deep Hierarchical Generative Model for Multimodal Representation Learning

Learning Multimodal VAEs through Mutual Supervision

Learning multi-modal generative models with permutation-invariant encoders and tighter variational objectives

What Makes for Robust Multi-Modal Models in the Face of Missing Modalities?

Multimodal Weibull Variational Autoencoder for Jointly Modeling Image-Text Data

Learning more expressive joint distributions in multimodal variational methods

Multimodal Variational Autoencoders for Semi-Supervised Learning: In Defense of Product-of-Experts

Multimodal Adversarially Learned Inference with Factorized Discriminators

Improving Bi-directional Generation between Different Modalities with Variational Autoencoders

Multi-Modal Latent Diffusion

Discriminative multimodal learning via conditional priors in generative models

Leveraging hierarchy in multimodal generative models for effective cross-modality inference

Multimodal deep generative adversarial models for scalable doubly semi-supervised learning

Variational methods for Conditional Multimodal Deep Learning

Score-Based Multimodal Autoencoder

Data-Dependent Conditional Priors for Unsupervised Learning of Multimodal Data

Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond