Abstract:Multi-modal models have shown a promising capability to effectively integrate information from various sources, yet meanwhile, they are found vulnerable to pervasive perturbations, such as uni-modal attacks and missing conditions. To counter these perturbations, robust multi-modal representations are highly expected, which are positioned well away from the discriminative multi-modal decision boundary. In this paper, different from conventional empirical studies, we focus on a commonly used joint multi-modal framework and theoretically discover that larger uni-modal representation margins and more reliable integration for modalities are essential components for achieving higher robustness. This discovery can further explain the limitation of multi-modal robustness and the phenomenon that multi-modal models are often vulnerable to attacks on the specific modality. Moreover, our analysis reveals how the widespread issue, that the model has different preferences for modalities, limits the multi-modal robustness by influencing the essential components and could lead to attacks on the specific modality highly effective. Inspired by our theoretical finding, we introduce a training procedure called Certifiable Robust Multi-modal Training (CRMT), which can alleviate this influence from modality preference and explicitly regulate essential components to significantly improve robustness in a certifiable manner. Our method demonstrates substantial improvements in performance and robustness compared with existing methods. Furthermore, our training procedure can be easily extended to enhance other robust training strategies, highlighting its credibility and flexibility.

Modality Complementariness: Towards Understanding Multi-modal Robustness

Quantifying and Enhancing Multi-modal Robustness with Modality Preference

What Makes for Robust Multi-Modal Models in the Face of Missing Modalities?

On Robustness in Multimodal Learning

Robust-MSA: Understanding the Impact of Modality Noise on Multimodal Sentiment Analysis

MACO: A Modality Adversarial and Contrastive Framework for Modality-missing Multi-modal Knowledge Graph Completion

Comprehensive Semi-Supervised Multi-Modal Learning.

Understanding and Measuring Robustness of Multimodal Learning

Enhancing Adversarial Robustness of Multi-modal Recommendation via Modality Balancing

Missing Modality Robustness in Semi-Supervised Multi-Modal Semantic Segmentation

On Adversarial Robustness of Large-scale Audio Visual Learning

What Makes Multi-modal Learning Better than Single (Provably)

Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably).

Complementarity is the king: Multi-modal and multi-grained hierarchical semantic enhancement network for cross-modal retrieval

A Theory of Multimodal Learning

Interpretation on Multi-modal Visual Fusion

Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications

On the Comparison between Multi-modal and Single-modal Contrastive Learning

Toward Robust Multimodal Learning using Multimodal Foundational Models

Learn to Combine Modalities in Multimodal Deep Learning

A survey of multi-modal learning theory