Abstract:Multimodality Representation Learning, as a technique of learning to embed information from different modalities and their correlations, has achieved remarkable success on a variety of applications, such as Visual Question Answering (VQA), Natural Language for Visual Reasoning (NLVR), and Vision Language Retrieval (VLR). Among these applications, cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task, e.g., understand, recognize, retrieve, or generate optimally. Researchers have proposed diverse methods to address these tasks. The different variants of transformer-based architectures performed extraordinarily on multiple modalities. This survey presents the comprehensive literature on the evolution and enhancement of deep learning multimodal architectures to deal with textual, visual and audio features for diverse cross-modal and modern multimodal tasks. This study summarizes the (i) recent task-specific deep learning methodologies, (ii) the pretraining types and multimodal pretraining objectives, (iii) from state-of-the-art pretrained multimodal approaches to unifying architectures, and (iv) multimodal task categories and possible future improvements that can be devised for better multimodal learning. Moreover, we prepare a dataset section for new researchers that covers most of the benchmarks for pretraining and finetuning. Finally, major challenges, gaps, and potential research topics are explored. A constantly-updated paperlist related to our survey is maintained at <a class="link-external link-https" href="https://github.com/marslanm/multimodality-representation-learning" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are several key challenges in multi - modal representation learning, including the heterogeneity gap between different modalities (i.e., the differences between different data types), effective processing methods for cross - modal tasks, and how to improve the performance of multi - modal systems through pre - training and fine - tuning. Specifically, the paper aims to: 1. **Summarize the development of multi - modal representation learning**: Starting from the basic concepts of multi - modal learning, explore its application progress in multiple fields such as natural language processing and computer vision, especially the evolution of deep - learning - based methods and techniques in recent years. 2. **Explore pre - training techniques**: Discuss in detail different types and methods of pre - training techniques, including Self - Supervised Learning (SSL), Continual Pretraining (CPT), Simultaneous Pretraining (SPT), etc., and how these techniques are applied in multi - modal scenarios to improve the generalization ability and efficiency of models. 3. **Specific methods for multi - modal tasks**: Introduce solutions to multiple multi - modal tasks, such as Visual Question Answering (VQA), Natural Language and Visual Reasoning (NLVR), Visual Language Retrieval (VLR), etc., and focus on analyzing how models based on the Transformer architecture have achieved remarkable results in these tasks. 4. **Prospect of future research directions**: Point out the main challenges existing in current multi - modal learning, such as high model complexity and large computational resource requirements, and propose possible research directions, including more efficient pre - training methods, more unified multi - modal architecture design, etc. In summary, this review paper aims to comprehensively summarize the latest progress in the field of multi - modal representation learning, provide a systematic reference framework for researchers, and promote the further development of this field.

Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Multimodal Learning with Transformers: A Survey

Multimodal Pretraining from Monolingual to Multilingual

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

A Survey of Vision-Language Pre-Trained Models

A Survey of Multimodal Large Language Model from A Data-centric Perspective

Research Progress on Vision-Language Multimodal Pretraining Model Technology

Vision+X: A Survey on Multimodal Learning in the Light of Data

Multimodal Pretraining, Adaptation, and Generation for Recommendation: A Survey

Recent Advances and Trends in Multimodal Deep Learning: A Review

Deep Multimodal Learning with Missing Modality: A Survey

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Self-Supervised Multimodal Learning: A Survey

VLP: A Survey on Vision-language Pre-training

Multimodality in meta-learning: A comprehensive survey

Multimodal Machine Learning: A Survey and Taxonomy

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Vision-Language Models for Vision Tasks: A Survey