Abstract:Multimodality Representation Learning, as a technique of learning to embed information from different modalities and their correlations, has achieved remarkable success on a variety of applications, such as Visual Question Answering (VQA), Natural Language for Visual Reasoning (NLVR), and Vision Language Retrieval (VLR). Among these applications, cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task, e.g., understand, recognize, retrieve, or generate optimally. Researchers have proposed diverse methods to address these tasks. The different variants of transformer-based architectures performed extraordinarily on multiple modalities. This survey presents the comprehensive literature on the evolution and enhancement of deep learning multimodal architectures to deal with textual, visual and audio features for diverse cross-modal and modern multimodal tasks. This study summarizes the (i) recent task-specific deep learning methodologies, (ii) the pretraining types and multimodal pretraining objectives, (iii) from state-of-the-art pretrained multimodal approaches to unifying architectures, and (iv) multimodal task categories and possible future improvements that can be devised for better multimodal learning. Moreover, we prepare a dataset section for new researchers that covers most of the benchmarks for pretraining and finetuning. Finally, major challenges, gaps, and potential research topics are explored. A constantly-updated paperlist related to our survey is maintained at <a class="link-external link-https" href="https://github.com/marslanm/multimodality-representation-learning" rel="external noopener nofollow">this https URL</a>.

Multimodality in meta-learning: A comprehensive survey

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Revisit Multimodal Meta-Learning through the Lens of Multi-Task Learning

Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review

VL-Meta: Vision-Language Models for Multimodal Meta-Learning

Recent Advances of Multimodal Continual Learning: A Comprehensive Survey

Multimodal meta-learning through meta-learned task representations

Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning

A survey of multimodal federated learning: background, applications, and perspectives

Multimodal Machine Learning: A Survey and Taxonomy

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

A Survey of Multimodal Large Language Model from A Data-centric Perspective

Deep Multimodal Learning with Missing Modality: A Survey

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications

A Comprehensive Overview and Survey of Recent Advances in Meta-Learning

Meta-Learning in Neural Networks: A Survey

Self-Supervised Multimodal Learning: A Survey

A Survey of Multimodal Composite Editing and Retrieval