Abstract:Multi-modal pretraining for learning high-level multi-modal representation is a further step towards deep learning and artificial intelligence. In this work, we propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6 (MultiModality-to-MultiModality Multitask Mega-transformer). The model owns strong capability of modeling interaction between the information flows of different modalities. The single-stream interaction module is capable of effectively processing information of multiple modalilties, and the two-stream module on top preserves the independence of each modality to avoid performance downgrade in single-modal tasks. We pretrain the model with three pretraining tasks, including masked segment modeling (MSM), masked region modeling (MRM) and image-text matching (ITM); and finetune the model on a series of vision-and-language downstream tasks. Experimental results demonstrate that InterBERT outperforms a series of strong baselines, including the most recent multi-modal pretraining methods, and the analysis shows that MSM and MRM are effective for pretraining and our method can achieve performances comparable to BERT in single-modal tasks. Besides, we propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model. We pretrain the Chinese InterBERT on our proposed dataset of 3.1M image-text pairs from the mobile Taobao, the largest Chinese e-commerce platform. We finetune the model for text-based image retrieval, and recently we deployed the model online for topic-based recommendation.

CANCN-BERT: A Joint Pre-Trained Language Model for Classical and Modern Chinese

Pre-Training with Whole Word Masking for Chinese BERT

ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information

Chinese NER Enhanced Based on Different Scale Pre-Training Models

AnchiBERT: A Pre-Trained Model for Ancient ChineseLanguage Understanding and Generation

AnchiBERT: A Pre-Trained Model for Ancient Chinese Language Understanding and Generation

Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese Pre-trained Language Models

Align, Mask and Select: A Simple Method for Incorporating Commonsense Knowledge into Language Representation Models

Improving Multi-model Hybrid Chinese Long-text Classification Through BERT Optimisation.

A complex network approach to analyse pre-trained language models for ancient Chinese

Pretraining Multi-modal Representations for Chinese NER Task with Cross-Modality Attention

Named Entity Recognition Based on Pre-training Model and Multi-head Attention Mechanism

Revisiting and Advancing Chinese Natural Language Understanding with Accelerated Heterogeneous Knowledge Pre-training

TCBERT: A Technical Report for Chinese Topic Classification BERT

CharBERT: Character-aware Pre-trained Language Model

MPC-BERT: A Pre-Trained Language Model for Multi-Party Conversation Understanding

MVP-BERT: Multi-Vocab Pre-training for Chinese BERT.

BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

NEZHA: Neural Contextualized Representation for Chinese Language Understanding

Towards Making the Most of BERT in Neural Machine Translation

InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining