Abstract:With the urgent demand for generalized deep models, many pre-trained big models are proposed, such as BERT, ViT, GPT, etc. Inspired by the success of these models in single domains (like computer vision and natural language processing), the multi-modal pre-trained big models have also drawn more and more attention in recent years. In this work, we give a comprehensive survey of these models and hope this paper could provide new insights and helps fresh researchers to track the most cutting-edge works. Specifically, we firstly introduce the background of multi-modal pre-training by reviewing the conventional deep learning, pre-training works in natural language process, computer vision, and speech. Then, we introduce the task definition, key challenges, and advantages of multi-modal pre-training models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network architectures, and knowledge enhanced pre-training. After that, we introduce the downstream tasks used for the validation of large-scale MM-PTMs, including generative, classification, and regression tasks. We also give visualization and analysis of the model parameters and results on representative downstream tasks. Finally, we point out possible research directions for this topic that may benefit future works. In addition, we maintain a continuously updated paper list for large-scale pre-trained multi-modal big models: <a class="link-external link-https" href="https://github.com/wangxiao5791509/MultiModal_BigModels_Survey" rel="external noopener nofollow">this https URL</a>. This paper has been published by the journal Machine Intelligence Research (MIR), <a class="link-external link-https" href="https://link.springer.com/article/10.1007/s11633-022-1410-8" rel="external noopener nofollow">this https URL</a>, DOI: <a class="link-https link-external" data-doi="10.1007/s11633-022-1410-8" href="https://doi.org/10.1007/s11633-022-1410-8" rel="external noopener nofollow">https://doi.org/10.1007/s11633-022-1410-8</a>, vol. 20, no. 4, pp. 447-482, 2023.

What problem does this paper attempt to address?

The main aim of this paper is to address the application and development issues of multimodal pre-training models (MM-PTMs) on large-scale datasets. Specifically: 1. **Review of Background and Progress**: First, it reviews the development history of traditional deep learning and unimodal pre-training (such as natural language processing, computer vision, and speech processing), pointing out that the success of large pre-training models in these fields has prompted researchers to explore the possibilities of multimodal fusion. 2. **Definition and Challenges of Multimodal Pre-training**: The paper provides a detailed definition of multimodal pre-training tasks and lists several key challenges, including the acquisition and cleaning of large-scale multimodal data, the design of network architectures suitable for large-scale multimodal pre-training, the design of effective pre-training objectives, the required large-scale computational resources, and parameter tuning techniques. 3. **Advantages Analysis**: Compared to large models of a single modality, multimodal pre-training models can be better applied in practical scenarios, such as multimodal collaborative generation, modality completion, cross-domain retrieval, etc. By leveraging multimodal information to compensate for the shortcomings of a single modality, the model performance can be improved. 4. **Introduction of Datasets**: The paper also summarizes a series of large-scale datasets used for multimodal pre-training, covering various combinations such as image-text pairs, video-text pairs, etc., aiming to help readers quickly understand the available data sources for pre-training. In summary, this paper systematically reviews the current research status of multimodal pre-training models, aiming to provide researchers in this field with a comprehensive understanding framework and to indicate possible future research directions.

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Pre-Trained Models: Past, Present and Future

Multimodal Pretraining from Monolingual to Multilingual

A Survey of Multimodal Large Language Model from A Data-centric Perspective

Multimodal Large Language Models: A Survey

Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

M6: Multi-Modality-to-Multi-Modality Multitask Mega-transformer for Unified Pretraining

Pre-trained models for natural language processing: A survey

How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey

A Survey on Multimodal Large Language Models

Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages

A Survey of Vision-Language Pre-Trained Models

Personalized Multimodal Large Language Models: A Survey

A Review of Multi-Modal Large Language and Vision Models

Multimodal Learning with Transformers: A Survey

Efficient Multimodal Large Language Models: A Survey

M6: A Chinese Multimodal Pretrainer.

Research Progress on Vision-Language Multimodal Pretraining Model Technology

A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks