Provable Dynamic Fusion for Low-Quality Multimodal Data

Qingyang Zhang,Haitao Wu,Changqing Zhang,Qinghua Hu,Huazhu Fu,Joey Tianyi Zhou,Xi Peng
2023-06-06
Abstract:The inherent challenge of multimodal fusion is to precisely capture the cross-modal correlation and flexibly conduct cross-modal interaction. To fully release the value of each modality and mitigate the influence of low-quality multimodal data, dynamic multimodal fusion emerges as a promising learning paradigm. Despite its widespread use, theoretical justifications in this field are still notably lacking. Can we design a provably robust multimodal fusion method? This paper provides theoretical understandings to answer this question under a most popular multimodal fusion framework from the generalization perspective. We proceed to reveal that several uncertainty estimation solutions are naturally available to achieve robust multimodal fusion. Then a novel multimodal fusion framework termed Quality-aware Multimodal Fusion (QMF) is proposed, which can improve the performance in terms of classification accuracy and model robustness. Extensive experimental results on multiple benchmarks can support our findings.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the problem of achieving reliable dynamic multimodal fusion under low-quality multimodal data. Specifically, the paper focuses on how to improve the performance of multimodal learning, particularly classification accuracy and model robustness, through a dynamic fusion mechanism when data quality varies. Existing multimodal fusion methods perform poorly when handling low-quality data, especially in high-noise or modality quality imbalance situations, which limits their practical applications. Therefore, the paper proposes a new multimodal fusion framework—Quality-aware Multimodal Fusion (QMF), aimed at improving these issues by introducing uncertainty estimation. The main contributions of the paper include: 1. Providing a rigorous theoretical framework that explains the advantages and standards of dynamic multimodal fusion, particularly in terms of generalization ability. 2. Demonstrating that dynamic fusion methods can outperform static fusion methods under certain conditions (e.g., when fusion weights are negatively correlated with single-modality generalization error). 3. Proposing a new dynamic multimodal fusion method, QMF, which evaluates the quality of each modality through techniques such as energy scores and dynamically adjusts the fusion weights accordingly, thereby achieving better performance on low-quality data. Through extensive experimental validation, QMF performs excellently in multiple benchmarks, not only outperforming other methods in terms of average accuracy and worst-case accuracy but also excelling in uncertainty estimation, further supporting its theoretical advantages.