Abstract:Multi-modal pretraining for learning high-level multi-modal representation is a further step towards deep learning and artificial intelligence. In this work, we propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6 (MultiModality-to-MultiModality Multitask Mega-transformer). The model owns strong capability of modeling interaction between the information flows of different modalities. The single-stream interaction module is capable of effectively processing information of multiple modalilties, and the two-stream module on top preserves the independence of each modality to avoid performance downgrade in single-modal tasks. We pretrain the model with three pretraining tasks, including masked segment modeling (MSM), masked region modeling (MRM) and image-text matching (ITM); and finetune the model on a series of vision-and-language downstream tasks. Experimental results demonstrate that InterBERT outperforms a series of strong baselines, including the most recent multi-modal pretraining methods, and the analysis shows that MSM and MRM are effective for pretraining and our method can achieve performances comparable to BERT in single-modal tasks. Besides, we propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model. We pretrain the Chinese InterBERT on our proposed dataset of 3.1M image-text pairs from the mobile Taobao, the largest Chinese e-commerce platform. We finetune the model for text-based image retrieval, and recently we deployed the model online for topic-based recommendation.

Image As a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

VL-BEiT: Generative Vision-Language Pretraining

Vision-language pre-training via modal interaction

Multimodal Pretraining from Monolingual to Multilingual

Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends

OmniVL: One Foundation Model for Image-Language and Video-Language Tasks

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

Multimodal Autoregressive Pre-training of Large Vision Encoders

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

MVP: Multimodality-Guided Visual Pre-training

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Vision-Language Pre-Training for Boosting Scene Text Detectors

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining

Leveraging per Image-Token Consistency for Vision-Language Pre-training

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

Unified Vision-Language Pre-Training for Image Captioning and VQA

Research on Image Captioning Based on Vision-language Pre-trained Models