Abstract:Image-text multimodal representation learning aligns data across modalities and enables important medical applications, e.g., image classification, visual grounding, and cross-modal retrieval. In this work, we establish a connection between multimodal representation learning and multiple instance learning. Based on this connection, we propose a generic framework for constructing permutation-invariant score functions with many existing multimodal representation learning approaches as special cases. Furthermore, we use the framework to derive a novel contrastive learning approach and demonstrate that our method achieves state-of-the-art results in several downstream tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to establish a multimodal representation between medical images and texts in order to improve the efficiency of the clinical workflow. Specifically, the author proposes a method based on Multiple Instance Learning (MIL) to construct an image - text multimodal representation. This method aims to reduce the need for labeled data through self - supervised learning and can effectively handle the local alignment problem between image regions and text fragments. This helps to achieve better performance in multiple downstream tasks, such as image classification, visual localization, and cross - modal retrieval, etc. The core contributions of the paper are as follows: 1. **Established the connection between multimodal representation learning and multiple instance learning**: The author shows the similarities in assumptions and goals between these two learning methods and uses this connection to propose a new algorithmic framework for learning the joint representation of images and texts. 2. **Proposed a general framework**: This framework can construct permutation - invariant scoring functions and encompasses many existing multimodal representation learning methods as special cases. This framework is not only applicable to image - text data but can also be extended to other types of multimodal data. 3. **Introduced the LSE+NL method**: This is a new contrastive learning method that combines local (Local) and global (Global) image - document scoring functions and utilizes the correlations between image regions. Experimental results show that this method achieves state - of - the - art performance on multiple downstream tasks. In summary, the main purpose of this paper is to propose a new multimodal representation learning method by combining the ideas of multiple instance learning, in order to improve the processing ability of medical image and text data, especially in the case of scarce labeled data, and can provide a more effective solution.

Using Multiple Instance Learning to Build Multimodal Representations

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis

Multimodal Contrastive Training for Visual Representation Learning

Contrastive Learning on Multimodal Analysis of Electronic Health Records

Multimodal Representation Learning by Alternating Unimodal Adaptation

What to align in multimodal contrastive learning?

On the Generalization of Multi-modal Contrastive Learning

Cross-modal contrastive learning for multimodal sentiment recognition

Improving the Modality Representation with Multi-View Contrastive Learning for Multimodal Sentiment Analysis

Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning.

Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification

MXM-CLR: A Unified Framework for Contrastive Learning of Multifold Cross-Modal Representations

Identifiability Results for Multimodal Contrastive Learning

Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning

Contrastive Multimodal Fusion with TupleInfoNCE

Multimodal Representation Learning via Maximization of Local Mutual Information

Revealing Multimodal Contrastive Representation Learning through Latent Partial Causal Models

Multi-Modal Representation via Contrastive Learning with Attention Bottleneck Fusion and Attentive Statistics Features

Linking Representations with Multimodal Contrastive Learning

Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning