Abstract:Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research. Models such as ViLBERT, LXMERT and UNITER have significantly lifted state of the art across a wide range of V+L benchmarks with joint image-text pre-training. However, little is known about the inner mechanisms that destine their impressive success. To reveal the secrets behind the scene of these powerful models, we present VALUE (Vision-And-Language Understanding Evaluation), a set of meticulously designed probing tasks (e.g., Visual Coreference Resolution, Visual Relation Detection, Linguistic Probing Tasks) generalizable to standard pre-trained V+L models, aiming to decipher the inner workings of multimodal pre-training (e.g., the implicit knowledge garnered in individual attention heads, the inherent cross-modal alignment learned through contextualized multimodal embeddings). Through extensive analysis of each archetypal model architecture via these probing tasks, our key observations are: (i) Pre-trained models exhibit a propensity for attending over text rather than images during inference. (ii) There exists a subset of attention heads that are tailored for capturing cross-modal interactions. (iii) Learned attention matrix in pre-trained models demonstrates patterns coherent with the latent alignment between image regions and textual words. (iv) Plotted attention patterns reveal visually-interpretable relations among image regions. (v) Pure linguistic knowledge is also effectively encoded in the attention heads. These are valuable insights serving to guide future work towards designing better model architecture and objectives for multimodal pre-training.

Inspecting Unification of Encoding and Matching with Transformer: A Case Study of Machine Reading Comprehension.

Feeding What You Need by Understanding What You Learned

Exploiting Diverse Information in Pre-Trained Language Model for Multi-Choice Machine Reading Comprehension

A Sentence Quality Evaluation Framework for Machine Reading Comprehension Incorporating Pre-trained Language Model.

Hybrid Embedding and Joint Training of Stacked Encoder for Opinion Question Machine Reading Comprehension.

MGRC: an End-to-End Multigranularity Reading Comprehension Model for Question Answering

CAT-BERT: A Context-Aware Transferable BERT Model for Multi-turn Machine Reading Comprehension.

A Comprehensive Verification Of Transformer In Text Classification

R-Trans: RNN Transformer Network for Chinese Machine Reading Comprehension.

Understanding Attention in Machine Reading Comprehension

Enhanced Pre-Trained Transformer with Aligned Attention Map for Text Matching

Multi-level network based on transformer encoder for fine-grained image–text matching

Machine Reading Comprehension: The Role of Contextualized Language Models and Beyond

Investigation on task effect analysis and optimization strategy of multimodal large model based on Transformers architecture for various languages

TRANS-BLSTM: Transformer with Bidirectional LSTM for Language Understanding

Modeling Bilingual Sentence Processing: Evaluating RNN and Transformer Architectures for Cross-Language Structural Priming

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

DUMA: Reading Comprehension with Transposition Thinking

Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine Reading Comprehension

Multi-Unit Transformers for Neural Machine Translation

The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives