Abstract:Multi-modal models have shown a promising capability to effectively integrate information from various sources, yet meanwhile, they are found vulnerable to pervasive perturbations, such as uni-modal attacks and missing conditions. To counter these perturbations, robust multi-modal representations are highly expected, which are positioned well away from the discriminative multi-modal decision boundary. In this paper, different from conventional empirical studies, we focus on a commonly used joint multi-modal framework and theoretically discover that larger uni-modal representation margins and more reliable integration for modalities are essential components for achieving higher robustness. This discovery can further explain the limitation of multi-modal robustness and the phenomenon that multi-modal models are often vulnerable to attacks on the specific modality. Moreover, our analysis reveals how the widespread issue, that the model has different preferences for modalities, limits the multi-modal robustness by influencing the essential components and could lead to attacks on the specific modality highly effective. Inspired by our theoretical finding, we introduce a training procedure called Certifiable Robust Multi-modal Training (CRMT), which can alleviate this influence from modality preference and explicitly regulate essential components to significantly improve robustness in a certifiable manner. Our method demonstrates substantial improvements in performance and robustness compared with existing methods. Furthermore, our training procedure can be easily extended to enhance other robust training strategies, highlighting its credibility and flexibility.

What Makes for Robust Multi-Modal Models in the Face of Missing Modalities?

Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation

Modality Invariant Multimodal Learning to Handle Missing Modalities: A Single-Branch Approach

Dealing with All-stage Missing Modality: Towards A Universal Model with Robust Reconstruction and Personalization

Deep Multimodal Learning with Missing Modality: A Survey

Toward Robust Multimodal Learning using Multimodal Foundational Models

Chameleon: Images Are What You Need For Multimodal Learning Robust To Missing Modalities

On Uni-Modal Feature Learning in Supervised Multi-Modal Learning

MMAN-M2: Multiple Multi-head Attentions Network based on Encoder with Missing Modalities

Maximum Likelihood Estimation for Multimodal Learning with Missing Modality

Exploring Missing Modality in Multimodal Egocentric Datasets

Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models

MMP: Towards Robust Multi-Modal Learning with Masked Modality Projection

MDA: An Interpretable and Scalable Multi-Modal Fusion under Missing Modalities and Intrinsic Noise Conditions

SMIL: Multimodal Learning with Severely Missing Modality

On Robustness in Multimodal Learning

On Uni-modal Feature Learning in Multi-modal Learning

Multimodal Federated Learning with Missing Modality via Prototype Mask and Contrast

Quantifying and Enhancing Multi-modal Robustness with Modality Preference

Model Composition for Multimodal Large Language Models

Are Multimodal Transformers Robust to Missing Modality?