Abstract:Multimodal learning is expected to boost model performance by integrating information from different modalities. However, its potential is not fully exploited because the widely-used joint training strategy, which has a uniform objective for all modalities, leads to imbalanced and under-optimized uni-modal representations. Specifically, we point out that there often exists modality with more discriminative information, e.g., vision of playing football and sound of blowing wind. They could dominate the joint training process, resulting in other modalities being significantly under-optimized. To alleviate this problem, we first analyze the under-optimized phenomenon from both the feed-forward and the back-propagation stages during optimization. Then, On-the-fly Prediction Modulation (OPM) and On-the-fly Gradient Modulation (OGM) strategies are proposed to modulate the optimization of each modality, by monitoring the discriminative discrepancy between modalities during training. Concretely, OPM weakens the influence of the dominant modality by dropping its feature with dynamical probability in the feed-forward stage, while OGM mitigates its gradient in the back-propagation stage. In experiments, our methods demonstrate considerable improvement across a variety of multimodal tasks. These simple yet effective strategies not only enhance performance in vanilla and task-oriented multimodal models, but also in more complex multimodal tasks, showcasing their effectiveness and flexibility. The source code is available at \url{<a class="link-external link-https" href="https://github.com/GeWu-Lab/BML_TPAMI2024" rel="external noopener nofollow">this https URL</a>}.

MM-Align: Learning Optimal Transport-based Alignment Dynamics for Fast and Accurate Inference on Missing Modality Sequences

Hierarchical Optimal Transport for Multimodal Distribution Alignment

MissModal: Increasing Robustness to Missing Modality in Multimodal Sentiment Analysis

Rethinking Uncertainly Missing and Ambiguous Visual Modality in Multi-Modal Entity Alignment

Alt-MoE:A Scalable Framework for Bidirectional Multimodal Alignment and Efficient Knowledge Integration

Text-centric Alignment for Multi-Modality Learning

On-the-fly Modulation for Balanced Multimodal Learning

HybridVocab: Towards Multi-Modal Machine Translation Via Multi-Aspect Alignment

Data-efficient Alignment of Multimodal Sequences by Aligning Gradient Updates and Internal Feature Distributions.

Rethinking Missing Modality Learning: From a Decoding View

Alternative Telescopic Displacement: An Efficient Multimodal Alignment Method

Multimodal Representation Learning by Alternating Unimodal Adaptation

Maximum Likelihood Estimation for Multimodal Learning with Missing Modality

AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment

Enhance the Robustness of Text-Centric Multimodal Alignments

Rethinking Missing Modality Learning from a Decoding Perspective

Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models

Propensity Score Alignment of Unpaired Multimodal Data

What Makes for Robust Multi-Modal Models in the Face of Missing Modalities?

Leveraging Intra-modal and Inter-modal Interaction for Multi-Modal Entity Alignment

Toward Robust Multimodal Learning using Multimodal Foundational Models