Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Xiao Wang,Guangyao Chen,Guangwu Qian,Pengcheng Gao,Xiao-Yong Wei,Yaowei Wang,Yonghong Tian,Wen Gao
2024-04-10
Abstract:With the urgent demand for generalized deep models, many pre-trained big models are proposed, such as BERT, ViT, GPT, etc. Inspired by the success of these models in single domains (like computer vision and natural language processing), the multi-modal pre-trained big models have also drawn more and more attention in recent years. In this work, we give a comprehensive survey of these models and hope this paper could provide new insights and helps fresh researchers to track the most cutting-edge works. Specifically, we firstly introduce the background of multi-modal pre-training by reviewing the conventional deep learning, pre-training works in natural language process, computer vision, and speech. Then, we introduce the task definition, key challenges, and advantages of multi-modal pre-training models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network architectures, and knowledge enhanced pre-training. After that, we introduce the downstream tasks used for the validation of large-scale MM-PTMs, including generative, classification, and regression tasks. We also give visualization and analysis of the model parameters and results on representative downstream tasks. Finally, we point out possible research directions for this topic that may benefit future works. In addition, we maintain a continuously updated paper list for large-scale pre-trained multi-modal big models: <a class="link-external link-https" href="https://github.com/wangxiao5791509/MultiModal_BigModels_Survey" rel="external noopener nofollow">this https URL</a>. This paper has been published by the journal Machine Intelligence Research (MIR), <a class="link-external link-https" href="https://link.springer.com/article/10.1007/s11633-022-1410-8" rel="external noopener nofollow">this https URL</a>, DOI: <a class="link-https link-external" data-doi="10.1007/s11633-022-1410-8" href="https://doi.org/10.1007/s11633-022-1410-8" rel="external noopener nofollow">https://doi.org/10.1007/s11633-022-1410-8</a>, vol. 20, no. 4, pp. 447-482, 2023.
Computer Vision and Pattern Recognition,Artificial Intelligence,Multimedia
What problem does this paper attempt to address?
The main aim of this paper is to address the application and development issues of multimodal pre-training models (MM-PTMs) on large-scale datasets. Specifically: 1. **Review of Background and Progress**: First, it reviews the development history of traditional deep learning and unimodal pre-training (such as natural language processing, computer vision, and speech processing), pointing out that the success of large pre-training models in these fields has prompted researchers to explore the possibilities of multimodal fusion. 2. **Definition and Challenges of Multimodal Pre-training**: The paper provides a detailed definition of multimodal pre-training tasks and lists several key challenges, including the acquisition and cleaning of large-scale multimodal data, the design of network architectures suitable for large-scale multimodal pre-training, the design of effective pre-training objectives, the required large-scale computational resources, and parameter tuning techniques. 3. **Advantages Analysis**: Compared to large models of a single modality, multimodal pre-training models can be better applied in practical scenarios, such as multimodal collaborative generation, modality completion, cross-domain retrieval, etc. By leveraging multimodal information to compensate for the shortcomings of a single modality, the model performance can be improved. 4. **Introduction of Datasets**: The paper also summarizes a series of large-scale datasets used for multimodal pre-training, covering various combinations such as image-text pairs, video-text pairs, etc., aiming to help readers quickly understand the available data sources for pre-training. In summary, this paper systematically reviews the current research status of multimodal pre-training models, aiming to provide researchers in this field with a comprehensive understanding framework and to indicate possible future research directions.