Using Multiple Instance Learning to Build Multimodal Representations

Peiqi Wang,William M. Wells,Seth Berkowitz,Steven Horng,Polina Golland
DOI: https://doi.org/10.1007/978-3-031-34048-2_35
2023-03-10
Abstract:Image-text multimodal representation learning aligns data across modalities and enables important medical applications, e.g., image classification, visual grounding, and cross-modal retrieval. In this work, we establish a connection between multimodal representation learning and multiple instance learning. Based on this connection, we propose a generic framework for constructing permutation-invariant score functions with many existing multimodal representation learning approaches as special cases. Furthermore, we use the framework to derive a novel contrastive learning approach and demonstrate that our method achieves state-of-the-art results in several downstream tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to establish a multimodal representation between medical images and texts in order to improve the efficiency of the clinical workflow. Specifically, the author proposes a method based on Multiple Instance Learning (MIL) to construct an image - text multimodal representation. This method aims to reduce the need for labeled data through self - supervised learning and can effectively handle the local alignment problem between image regions and text fragments. This helps to achieve better performance in multiple downstream tasks, such as image classification, visual localization, and cross - modal retrieval, etc. The core contributions of the paper are as follows: 1. **Established the connection between multimodal representation learning and multiple instance learning**: The author shows the similarities in assumptions and goals between these two learning methods and uses this connection to propose a new algorithmic framework for learning the joint representation of images and texts. 2. **Proposed a general framework**: This framework can construct permutation - invariant scoring functions and encompasses many existing multimodal representation learning methods as special cases. This framework is not only applicable to image - text data but can also be extended to other types of multimodal data. 3. **Introduced the LSE+NL method**: This is a new contrastive learning method that combines local (Local) and global (Global) image - document scoring functions and utilizes the correlations between image regions. Experimental results show that this method achieves state - of - the - art performance on multiple downstream tasks. In summary, the main purpose of this paper is to propose a new multimodal representation learning method by combining the ideas of multiple instance learning, in order to improve the processing ability of medical image and text data, especially in the case of scarce labeled data, and can provide a more effective solution.