Abstract:In order to perform multimodal fusion of heterogeneous signals, we need to understand their interactions: how each modality individually provides information useful for a task and how this information changes in the presence of other modalities. In this paper, we perform a comparative study of how humans annotate two categorizations of multimodal interactions: (1) partial labels, where different annotators annotate the label given the first, second, and both modalities, and (2) counterfactual labels, where the same annotator annotates the label given the first modality before asking them to explicitly reason about how their answer changes when given the second. We further propose an alternative taxonomy based on (3) information decomposition, where annotators annotate the degrees of redundancy: the extent to which modalities individually and together give the same predictions, uniqueness: the extent to which one modality enables a prediction that the other does not, and synergy: the extent to which both modalities enable one to make a prediction that one would not otherwise make using individual modalities. Through experiments and annotations, we highlight several opportunities and limitations of each approach and propose a method to automatically convert annotations of partial and counterfactual labels to information decomposition, yielding an accurate and efficient method for quantifying multimodal interactions.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of quantifying and measuring the interactions between different modalities in multimodal fusion. Specifically, the authors focus on how to understand the information provided by each modality in a given task and how this information changes in the presence of other modalities. To achieve this goal, the paper conducts research in the following aspects: 1. **Partial Labels**: - Different annotators label tags respectively according to a single modality (such as video or language) and a combination of two modalities. - The goal is to understand the ability of each modality to provide information alone and their performance after combination. 2. **Counterfactual Labels**: - The same annotator first labels according to one modality, and then considers the influence of the second modality and re - evaluates their labels. - This method can more directly measure the actual causal influence of the second modality on the prediction result. 3. **Information Decomposition**: - A method based on information theory is proposed to decompose the total information provided by two modalities into redundancy, uniqueness, and synergy. - Redundancy refers to the degree to which each modality gives similar predictions alone and in combination; - Uniqueness refers to the prediction that a certain modality can make but another modality cannot; - Synergy refers to the prediction that can be made only when two modalities exist simultaneously. ### Specific problems - **How to quantify the interactions between multimodalities?** - Quantify and compare the interactions between different modalities through three methods: partial labels, counterfactual labels, and information decomposition. - **How to improve the reliability and efficiency of multimodal data annotation?** - A new annotation scheme is proposed, enabling human annotators to more accurately estimate redundancy, uniqueness, and synergy. - **How to convert existing annotation methods into information decomposition forms?** - A method for automatically converting partial labels and counterfactual labels into information decomposition values is proposed, making these methods more consistent and comparable in practical applications. ### Summary The core problem of this paper is to explore and propose effective methods to understand and quantify the interactions in multimodal data, especially in complex real - world datasets. By introducing the concept of information decomposition, the authors hope to better deconstruct and explain the relationships between different modalities, thereby providing new theoretical and practical tools for multimodal machine learning.

Multimodal Fusion Interactions: A Study of Human and Automatic Quantification

Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework

Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications

Interpretation on Multi-modal Visual Fusion

Deep Multimodal Data Fusion

Efficient Low-rank Multimodal Fusion with Modality-Specific Factors

Optimal Multimodal Fusion for Multimedia Data Analysis

Dual Low-Rank Multimodal Fusion

Multimodal Language Analysis with Recurrent Multistage Fusion

High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning

Multimodal Metadata Fusion Using Causal Strength

Multimodal Fusion on Low-quality Data: A Comprehensive Survey

Improving Multimodal fusion via Mutual Dependency Maximisation

InterMulti:Multi-view Multimodal Interactions with Text-dominated Hierarchical High-order Fusion for Emotion Analysis

Exploiting "Quantum-like Interference" in Decision Fusion for Ranking Multimodal Documents

Tri-Modalities Fusion for Multimodal Sentiment Analysis

Deep Equilibrium Multimodal Fusion

Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis

Multimodal fusion for multimedia analysis: a survey

Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition