Abstract:In multimodal machine learning, proper handling of cross-modal information is essential for obtaining an ideal joint embedding. Despite the progress made by recent fusion strategies, we hold that before the fusion stage, the unimodal representation inevitably contains noise that may hinder the correct learning of cross-modal dynamics and affect multimodal fusion. It is worthwhile to investigate how the information is being utilized and how to make the full use of it. Rethinking the process of leveraging multiple modalities for the joint embedding, multimodal learning can be regarded as a chemical reaction process and two steps may benefit learning: 1) purification to filter impurity, and 2) catalyst to facilitate learning. In this paper, we propose a Multimodal Information Modulation (MIM) learning framework to modulate the contribution and utilization of the cross-modal information, which identifies and handles the ‘impurity’ and ‘catalyst’ in multimodal learning. Specifically, a Unimodal Purification Network (UPN) is proposed to identify and explicitly filter out the impurity within each modality before fusion, which reduces the possibility of learning incorrect cross-modal dynamics. Besides, based on the intuition that useful information has the potential in the guidance of model updating, it plays a role to facilitate learning, which is achieved by the design of the Knowledge Guidance Scheme (KGS) considering both the intra- and inter-modal scenarios. Different to a majority of works that emphasize the role of useful information in the fusion and inference stage, KGS considers its potential role in assisting the representation learning of weaker components. Besides, it fully considers the modality dominance problem and sample variations for optimization. In short, MIM manages to modulate the useless/useful information to minimize/emphasize their contribution. Experimental results verify the effectiveness of the proposed method. The codes are available at https://github.com/zengy268/MIM .

Countering Modal Redundancy and Heterogeneity: A Self-Correcting Multimodal Fusion

Multimodal Fusion Method Based on Self-Attention Mechanism

Dual Low-Rank Multimodal Fusion

Multimodal Fusion with Co-attention Mechanism

Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition

Mutually Beneficial Transformer for Multimodal Data Fusion

A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning

Attention is not Enough: Mitigating the Distribution Discrepancy in Asynchronous Multimodal Sequence Fusion

MultiFuser: Multimodal Fusion Transformer for Enhanced Driver Action Recognition

Feature Fusion Based on Transformer for Cross-modal Retrieval

Tri-Modalities Fusion for Multimodal Sentiment Analysis

MEFusion: Unsupervised Mutual Enhancement for Multimodal Image Fusion

UniCat: Crafting a Stronger Fusion Baseline for Multimodal Re-Identification

Incomplete Multimodal Learning for Remote Sensing Data Fusion

Multi-Feature Fusion Multi-Modal Sentiment Analysis Model Based on Cross-Attention Mechanism

Multimodal Token Fusion for Vision Transformers

Optimal Multimodal Fusion for Multimedia Data Analysis

MIMF: Mutual Information-Driven Multimodal Fusion

Multimodal Reaction: Information Modulation for Cross-modal Representation Learning

Learn to Combine Modalities in Multimodal Deep Learning

A Novel Approach to Incomplete Multimodal Learning for Remote Sensing Data Fusion