Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning

Divyam Madaan,Taro Makino,Sumit Chopra,Kyunghyun Cho
2024-12-06
Abstract:Supervised multi-modal learning involves mapping multiple modalities to a target label. Previous studies in this field have concentrated on capturing in isolation either the inter-modality dependencies (the relationships between different modalities and the label) or the intra-modality dependencies (the relationships within a single modality and the label). We argue that these conventional approaches that rely solely on either inter- or intra-modality dependencies may not be optimal in general. We view the multi-modal learning problem from the lens of generative models where we consider the target as a source of multiple modalities and the interaction between them. Towards that end, we propose inter- & intra-modality modeling (I2M2) framework, which captures and integrates both the inter- and intra-modality dependencies, leading to more accurate predictions. We evaluate our approach using real-world healthcare and vision-and-language datasets with state-of-the-art models, demonstrating superior performance over traditional methods focusing only on one type of modality dependency.
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is in multimodal learning, how to effectively capture and integrate the dependency relationships between different modalities (i.e., cross - modal dependencies) and the relationships within each modality and the target label (i.e., intra - modal dependencies). Traditional methods usually focus on only one of these dependency relationships and ignore the other, which may lead to poor model performance. The author believes that in order to predict the target label more accurately, both of these dependency relationships need to be considered simultaneously, and a new framework - Inter - & Intra - Modality Modeling (I2M2) is proposed to comprehensively handle these two types of dependency relationships. Specifically, the paper points out: 1. **Limitations of existing methods**: Existing multimodal learning methods either focus on capturing cross - modal dependencies (i.e., relationships between different modalities) or intra - modal dependencies (i.e., information within a single modality). These methods perform well in some cases, but in other cases may be less effective than a single - modality learner or a simple combination of single - modality learners. 2. **The proposed new framework**: The author redefines the multimodal learning problem from the perspective of the generative model and proposes the I2M2 framework. This framework captures the statistical dependency relationships between modalities by introducing the selection variable \(v\), and simultaneously considers cross - modal dependencies and intra - modal dependencies. The selection variable \(v\) is always set to 1, indicating that the label plays a modulating role in the generation of modalities and their interactions. 3. **Theoretical basis**: The data generation process in the paper assumes that the label \(y\) generates data \(x\) and \(x'\) of two modalities, and defines the statistical dependency relationships between these modalities and the label. The specific joint probability distribution can be expressed as: \[ p(y, x, x', v = 1)=p(y)p(x|y)p(x'|y)p(v = 1|x, x', y) \] 4. **Experimental verification**: The author evaluates the effectiveness of the I2M2 framework on multiple real - world datasets, including healthcare datasets, visual - language tasks, etc. The experimental results show that the I2M2 framework performs excellently in various tasks and can provide a stable performance improvement regardless of which dependency relationship is more important. In summary, the main contribution of this paper is to propose a new multimodal learning framework that can simultaneously handle cross - modal dependencies and intra - modal dependencies, thereby improving the prediction accuracy of the model.