Abstract:Multimodal learning, which aims to understand and analyze information from multiple modalities, has achieved substantial progress in the supervised regime in recent years. However, the heavy dependence on data paired with expensive human annotations impedes scaling up models. Meanwhile, given the availability of large-scale unannotated data in the wild, self-supervised learning has become an attractive strategy to alleviate the annotation bottleneck. Building on these two directions, self-supervised multimodal learning (SSML) provides ways to learn from raw multimodal data. In this survey, we provide a comprehensive review of the state-of-the-art in SSML, in which we elucidate three major challenges intrinsic to self-supervised learning with multimodal data: (1) learning representations from multimodal data without labels, (2) fusion of different modalities, and (3) learning with unaligned data. We then detail existing solutions to these challenges. Specifically, we consider (1) objectives for learning from multimodal unlabeled data via self-supervision, (2) model architectures from the perspective of different multimodal fusion strategies, and (3) pair-free learning strategies for coarse-grained and fine-grained alignment. We also review real-world applications of SSML algorithms in diverse fields such as healthcare, remote sensing, and machine translation. Finally, we discuss challenges and future directions for SSML. A collection of related resources can be found at: <a class="link-external link-https" href="https://github.com/ys-zong/awesome-self-supervised-multimodal-learning" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the three core challenges faced by Self - Supervised Multimodal Learning (SSML) in multimodal learning: 1. **Unlabeled Multimodal Representation Learning**: How to learn representations from multimodal data without labels? 2. **Fusion of Different Modalities**: How to effectively fuse information from different modalities? 3. **Learning from Unaligned Data**: How to handle partially or completely unaligned multimodal data? ### Specific Problem Analysis #### 1. Unlabeled Multimodal Representation Learning - **Background**: Multimodal learning usually depends on expensive manually - annotated data, which limits the model's scalability. Self - supervised learning can effectively alleviate this problem by using a large amount of unannotated data. - **Challenge**: How to design an effective self - supervised objective function so that the model can learn useful representations from multimodal data without labels? - **Solutions**: - **Instance Discrimination**: Distinguish positive and negative samples through contrastive learning methods, such as the contrastive loss function \( L_{\text{Con}} \). - **Clustering**: Group data by semantic features through clustering methods. - **Mask Prediction**: Learn representations by predicting the masked parts, for example, predicting missing words in text. #### 2. Fusion of Different Modalities - **Background**: Multimodal data contains multiple types of information. How to effectively integrate this information is a key issue. - **Challenge**: How to design a model architecture so that information from different modalities can be effectively fused? - **Solutions**: - **Joint Fusion**: Use a unified encoder and fusion module to fuse information from different modalities early. - **Independent Pretraining**: Pretrain each modality separately and then "stitch" these models together through self - supervised methods. #### 3. Learning from Unaligned Data - **Background**: Multimodal data may have alignment problems at coarse - grained (such as image - caption pairs) and fine - grained (such as bounding box - word pairs) levels. - **Challenge**: How to learn when the data is unaligned? - **Solutions**: - **Coarse - grained Alignment**: Process various pairing scenarios, such as noisy pairing, mixed pairing, and completely unpaired data. - **Fine - grained Alignment**: Derive fine - grained alignment relationships through implicit or explicit methods. ### Paper Contributions - **Comprehensive Review**: The paper provides a comprehensive review covering the latest progress, methods, datasets, and implementations of SSML. - **Classification System**: Proposes a new classification system and discusses in detail the unique challenges and solutions in SSML. - **Practical Applications**: Explores the practical applications of SSML in various fields, such as healthcare, remote sensing, machine translation, etc., and discusses the technical challenges and social impacts. By solving these core challenges, SSML is expected to further promote the development of multimodal learning and improve the generalization ability and application range of models.

Self-Supervised Multimodal Learning: A Survey

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data

Survey on Self-Supervised Multimodal Representation Learning and Foundation Models

Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

Recent Advances of Multimodal Continual Learning: A Comprehensive Survey

Continual Self-supervised Learning: Towards Universal Multi-modal Medical Data Representation Learning

Deep Multimodal Learning with Missing Modality: A Survey

Multi-Modal Self-Supervised Learning for Recommendation

Self-HCL: Self-Supervised Multitask Learning with Hybrid Contrastive Learning Strategy for Multimodal Sentiment Analysis

Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications

Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision

Semi-Supervised Multi-Modal Learning with Incomplete Modalities

Self-Supervised Multimodal Domino: in Search of Biomarkers for Alzheimer's Disease

Self-Supervised Learning for Videos: A Survey

Multimodality in meta-learning: A comprehensive survey

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

A survey of multimodal federated learning: background, applications, and perspectives

A Survey of Multimodal Large Language Model from A Data-centric Perspective

Vision+X: A Survey on Multimodal Learning in the Light of Data

Learning Unseen Modality Interaction