Abstract:Multi-modal pretraining for learning high-level multi-modal representation is a further step towards deep learning and artificial intelligence. In this work, we propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6 (MultiModality-to-MultiModality Multitask Mega-transformer). The model owns strong capability of modeling interaction between the information flows of different modalities. The single-stream interaction module is capable of effectively processing information of multiple modalilties, and the two-stream module on top preserves the independence of each modality to avoid performance downgrade in single-modal tasks. We pretrain the model with three pretraining tasks, including masked segment modeling (MSM), masked region modeling (MRM) and image-text matching (ITM); and finetune the model on a series of vision-and-language downstream tasks. Experimental results demonstrate that InterBERT outperforms a series of strong baselines, including the most recent multi-modal pretraining methods, and the analysis shows that MSM and MRM are effective for pretraining and our method can achieve performances comparable to BERT in single-modal tasks. Besides, we propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model. We pretrain the Chinese InterBERT on our proposed dataset of 3.1M image-text pairs from the mobile Taobao, the largest Chinese e-commerce platform. We finetune the model for text-based image retrieval, and recently we deployed the model online for topic-based recommendation.

CMV-BERT: Contrastive Multi-Vocab Pretraining of BERT

MVP-BERT: Multi-Vocab Pre-training for Chinese BERT.

Dense Contrastive Visual-Linguistic Pretraining

Contrastive Visual-Linguistic Pretraining

MVP-BERT: Redesigning Vocabularies for Chinese BERT and Multi-Vocab Pretraining

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

Multimodal Contrastive Training for Visual Representation Learning

Multimodal Pretraining from Monolingual to Multilingual

MULTI-LEVEL CONTRASTIVE LEARNING FOR CROSS-LINGUAL ALIGNMENT

InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Improving Pre-Trained Multilingual Model with Vocabulary Expansion

RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training.

Enhancing Biomedical Multi-modal Representation Learning with Multi-scale Pre-training and Perturbed Report Discrimination

Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

C2BERT - Cross-contrast BERT for Chinese Biomedical Sentence Representation.

Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning

CAT-BERT: A Context-Aware Transferable BERT Model for Multi-turn Machine Reading Comprehension.

Pretraining without wordpieces: learning over a vocabulary of millions of words

CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising

Open-Vocabulary Panoptic Segmentation Using BERT Pre-Training of Vision-Language Multiway Transformer Model