Abstract:Recent Vision-Language Pre-trained (VLP) models based on dual encoder have attracted extensive attention from academia and industry due to their superior performance on various cross-modal tasks and high computational efficiency. They attempt to learn cross-modal representation using contrastive learning on image-text pairs, however, the built inter-modal correlations only rely on a single view for each modality. Actually, an image or a text contains various potential views, just as humans could capture a real-world scene via diverse descriptions or photos. In this paper, we propose ERNIE-ViL 2.0, a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously, aiming at learning a more robust cross-modal representation. Specifically, we construct multiple views within each modality to learn the intra-modal correlation for enhancing the single-modal representation. Besides the inherent visual/textual views, we construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs. Pre-trained with 29M publicly available datasets, ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval. Additionally, to generalize our method to Chinese cross-modal tasks, we train ERNIE-ViL 2.0 through scaling up the pre-training datasets to 1.5B Chinese image-text pairs, resulting in significant improvements compared to previous SOTA results on Chinese cross-modal retrieval. We release our pre-trained models in <a class="link-external link-https" href="https://github.com/PaddlePaddle/ERNIE" rel="external noopener nofollow">this https URL</a>.

Recognizing Cross-Lingual Textual Entailment with Co-Training Using Similarity and Difference Views

ECNUCS: Recognizing Cross-lingual Textual Entailment Using Multiple Text Similarity and Text Difference Measures.

Cross-Lingual Text Image Recognition Via Multi-Task Sequence to Sequence Learning.

Cross-Lingual Entity Matching for Heterogeneous Online Wikis.

Co-training for Cross-Lingual Sentiment Classification

Knowledge-Enhanced Bilingual Textual Representations for Cross-Lingual Semantic Textual Similarity

ECNU: Leveraging on Ensemble of Heterogeneous Features and Information Enrichment for Cross Level Semantic Similarity Estimation

Chinese Textual Entailment Recognition Based on Syntactic Tree Clipping

Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training

From Alignment to Entailment: A Unified Textual Entailment Framework for Entity Alignment

ECNU: One Stone Two Birds: Ensemble of Heterogenous Measures for Semantic Relatedness and Textual Entailment

Co-training Embeddings of Knowledge Graphs and Entity Descriptions for Cross-lingual Entity Alignment

Towards Multi-Sense Cross-Lingual Alignment of Contextual Embeddings

Cross-Language Similar Document Retrieval

ECNU at SemEval-2017 Task 1: Leverage Kernel-based Traditional NLP Features and Neural Networks to Build a Universal Model for Multilingual and Cross-lingual Semantic Textual Similarity

Unified Training for Cross-Lingual Abstractive Summarization by Aligning Parallel Machine Translation Pairs

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

ECNU: Using Traditional Similarity Measurements and Word Embedding for Semantic Textual Similarity Estimation.

HC$^2$L: Hybrid and Cooperative Contrastive Learning for Cross-lingual Spoken Language Understanding

Bilingual co-training for sentiment classification of chinese product reviews

Hessian-regularized Co-Training for Social Activity Recognition.