Abstract:Supervised multi-modal learning involves mapping multiple modalities to a target label. Previous studies in this field have concentrated on capturing in isolation either the inter-modality dependencies (the relationships between different modalities and the label) or the intra-modality dependencies (the relationships within a single modality and the label). We argue that these conventional approaches that rely solely on either inter- or intra-modality dependencies may not be optimal in general. We view the multi-modal learning problem from the lens of generative models where we consider the target as a source of multiple modalities and the interaction between them. Towards that end, we propose inter- & intra-modality modeling (I2M2) framework, which captures and integrates both the inter- and intra-modality dependencies, leading to more accurate predictions. We evaluate our approach using real-world healthcare and vision-and-language datasets with state-of-the-art models, demonstrating superior performance over traditional methods focusing only on one type of modality dependency.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is in multimodal learning, how to effectively capture and integrate the dependency relationships between different modalities (i.e., cross - modal dependencies) and the relationships within each modality and the target label (i.e., intra - modal dependencies). Traditional methods usually focus on only one of these dependency relationships and ignore the other, which may lead to poor model performance. The author believes that in order to predict the target label more accurately, both of these dependency relationships need to be considered simultaneously, and a new framework - Inter - & Intra - Modality Modeling (I2M2) is proposed to comprehensively handle these two types of dependency relationships. Specifically, the paper points out: 1. **Limitations of existing methods**: Existing multimodal learning methods either focus on capturing cross - modal dependencies (i.e., relationships between different modalities) or intra - modal dependencies (i.e., information within a single modality). These methods perform well in some cases, but in other cases may be less effective than a single - modality learner or a simple combination of single - modality learners. 2. **The proposed new framework**: The author redefines the multimodal learning problem from the perspective of the generative model and proposes the I2M2 framework. This framework captures the statistical dependency relationships between modalities by introducing the selection variable \(v\), and simultaneously considers cross - modal dependencies and intra - modal dependencies. The selection variable \(v\) is always set to 1, indicating that the label plays a modulating role in the generation of modalities and their interactions. 3. **Theoretical basis**: The data generation process in the paper assumes that the label \(y\) generates data \(x\) and \(x'\) of two modalities, and defines the statistical dependency relationships between these modalities and the label. The specific joint probability distribution can be expressed as: \[ p(y, x, x', v = 1)=p(y)p(x|y)p(x'|y)p(v = 1|x, x', y) \] 4. **Experimental verification**: The author evaluates the effectiveness of the I2M2 framework on multiple real - world datasets, including healthcare datasets, visual - language tasks, etc. The experimental results show that the I2M2 framework performs excellently in various tasks and can provide a stable performance improvement regardless of which dependency relationship is more important. In summary, the main contribution of this paper is to propose a new multimodal learning framework that can simultaneously handle cross - modal dependencies and intra - modal dependencies, thereby improving the prediction accuracy of the model.

Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning

Intra- and Inter-Modal Curriculum for Multimodal Learning

Learn to Combine Modalities in Multimodal Deep Learning

Joint Multimodal Learning with Deep Generative Models

A Mathematical Framework for Characterizing Dependency Structures of Multimodal Learning

MDA: An Interpretable and Scalable Multi-Modal Fusion under Missing Modalities and Intrinsic Noise Conditions

A Mathematical Framework to Characterize the Dependency Structures in Multimodal Learning with Minimax Principle

One-stage Modality Distillation for Incomplete Multimodal Learning

Inter-modality Dependence Induced Data Recovery for MCI Conversion Prediction

Comprehensive Semi-Supervised Multi-Modal Learning.

Detached and Interactive Multimodal Learning

Deep Multi-Modal Sets

Multimodal Generative Models for Scalable Weakly-Supervised Learning

Semi-Supervised Multi-Modal Learning with Incomplete Modalities

Multimodal Understanding Through Correlation Maximization and Minimization

What Makes Multimodal In-Context Learning Work?

Learning Unseen Modality Interaction

Joint Dictionary Learning and Semantic Constrained Latent Subspace Projection for Cross-Modal Retrieval.

Progressive Learning of a Multimodal Classifier Accounting for Different Modality Combinations

On-the-fly Modulation for Balanced Multimodal Learning

Discriminative multimodal learning via conditional priors in generative models