Abstract:Multimodal learning helps to comprehensively understand the world, by integrating different senses. Accordingly, multiple input modalities are expected to boost model performance, but we actually find that they are not fully exploited even when the multimodal model outperforms its uni-modal counterpart. Specifically, in this paper we point out that existing multimodal discriminative models, in which uniform objective is designed for all modalities, could remain under-optimized uni-modal representations, caused by another dominated modality in some scenarios, e.g., sound in blowing wind event, vision in drawing picture event, etc. To alleviate this optimization imbalance, we propose on-the-fly gradient modulation to adaptively control the optimization of each modality, via monitoring the discrepancy of their contribution towards the learning objective. Further, an extra Gaussian noise that changes dynamically is introduced to avoid possible generalization drop caused by gradient modulation. As a result, we achieve considerable improvement over common fusion methods on different multimodal tasks, and this simple strategy can also boost existing multimodal methods, which illustrates its efficacy and versatility. The source code is available at \url{<a class="link-external link-https" href="https://github.com/GeWu-Lab/OGM-GE_CVPR2022" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in multimodal learning, although jointly - trained multimodal models are superior to unimodal models in some cases, there still exists the phenomenon of unbalanced optimization. Specifically, when a certain modality (such as vision or audition) performs well, it may dominate the entire optimization process of the model, resulting in insufficient optimization of another modality. This unbalanced optimization restricts the overall performance improvement of multimodal models. ### Problem Background Multimodal learning integrates information from different senses to understand the world more comprehensively and can theoretically improve the performance of the model. However, existing multimodal discriminative models usually design a unified objective function to optimize all modalities, which may cause some modalities to be suppressed by the dominant modality during the optimization process, thus being unable to fully utilize the potential of all modalities. ### Specific Phenomenon The paper points out that in some multimodal scenarios, the better - performing modality (such as the auditory modality in the sound of the wind, the visual modality in playing football) will suppress the optimization of other modalities. For example, on the VGGSound dataset, although the jointly - trained multimodal model performs best in the event classification task, the performance of its internal visual and auditory modalities is significantly lower than that of the separately - trained visual and auditory models. ### Solution To alleviate this unbalanced optimization problem, the authors propose the "On - the - fly Gradient Modulation" (OGM) method, which adaptively controls the optimization process of each modality by dynamically monitoring the contribution differences of each modality to the learning objective. In addition, to prevent the possible decline in generalization ability brought by gradient modulation, additional Gaussian noise is introduced to enhance the generalization ability. ### Method Overview 1. **Unbalanced Optimization Analysis**: - Analyzed the unbalanced phenomenon existing in the optimization process of multimodal models, that is, the better - performing modality will dominate the optimization process, resulting in insufficient optimization of other modalities. - Explained the reasons for this unbalanced phenomenon through formula derivation. 2. **On - the - fly Gradient Modulation (OGM)**: - Designed a method to dynamically monitor the contribution differences of each modality and adaptively adjust the gradients according to these differences. - Use a coefficient \( k_u \) to adjust the gradients, where the value of \( k_u \) depends on the contribution ratio \( \rho_u \) of the modality. 3. **Generalization Ability Enhancement (GE)**: - Introduced additional Gaussian noise to prevent the possible decline in generalization ability brought by gradient modulation. - By adding noise, restore or even enhance the generalization ability of the SGD optimization method. ### Experimental Results The authors conducted experiments on multiple multimodal tasks and datasets, and the results show that after combining the OGM - GE strategy, both traditional fusion methods and existing multimodal methods have achieved significant performance improvements. This proves the effectiveness and universality of the OGM - GE method. ### Summary The main contributions of this paper are: - Discovering and analyzing the unbalanced optimization phenomenon existing in multimodal models. - Proposing the On - the - fly Gradient Modulation (OGM) and Generalization Ability Enhancement (GE) methods to solve this problem. - Experimentally proving the effectiveness and universality of this method in multiple multimodal tasks.

Balanced Multimodal Learning via On-the-fly Gradient Modulation

On-the-fly Modulation for Balanced Multimodal Learning

Classifier-guided Gradient Modulation for Enhanced Multimodal Learning

Boosting Multi-modal Model Performance with Adaptive Gradient Modulation

Improving Multimodal Learning with Multi-Loss Gradient Modulation

Multimodal Classification via Modal-Aware Interactive Enhancement

Diagnosing and Re-learning for Balanced Multimodal Learning

Towards Balanced Active Learning for Multimodal Classification

MMPareto: Boosting Multimodal Learning with Innocent Unimodal Assistance

Gradient-Guided Modality Decoupling for Missing-Modality Robustness

Learning to Balance the Learning Rates Between Various Modalities Via Adaptive Tracking Factor

Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised Audio-Visual Video Parsing

Multimodal Boosting: Addressing Noisy Modalities and Identifying Modality Contribution

The Balanced Multi-Modal Spiking Neural Networks with Online Loss Adjustment and Time Alignment

Learning to Rebalance Multi-Modal Optimization by Adaptively Masking Subnetworks

Multimodal Representation Learning by Alternating Unimodal Adaptation

ReconBoost: Boosting Can Achieve Modality Reconcilement

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

Multimodal Fusion Balancing Through Game-Theoretic Regularization

Learn to Combine Modalities in Multimodal Deep Learning

FULLER: Unified Multi-modality Multi-task 3D Perception Via Multi-level Gradient Calibration.