Abstract:Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research. Models such as ViLBERT, LXMERT and UNITER have significantly lifted state of the art across a wide range of V+L benchmarks with joint image-text pre-training. However, little is known about the inner mechanisms that destine their impressive success. To reveal the secrets behind the scene of these powerful models, we present VALUE (Vision-And-Language Understanding Evaluation), a set of meticulously designed probing tasks (e.g., Visual Coreference Resolution, Visual Relation Detection, Linguistic Probing Tasks) generalizable to standard pre-trained V+L models, aiming to decipher the inner workings of multimodal pre-training (e.g., the implicit knowledge garnered in individual attention heads, the inherent cross-modal alignment learned through contextualized multimodal embeddings). Through extensive analysis of each archetypal model architecture via these probing tasks, our key observations are: (i) Pre-trained models exhibit a propensity for attending over text rather than images during inference. (ii) There exists a subset of attention heads that are tailored for capturing cross-modal interactions. (iii) Learned attention matrix in pre-trained models demonstrates patterns coherent with the latent alignment between image regions and textual words. (iv) Plotted attention patterns reveal visually-interpretable relations among image regions. (v) Pure linguistic knowledge is also effectively encoded in the attention heads. These are valuable insights serving to guide future work towards designing better model architecture and objectives for multimodal pre-training.

Probing Pretrained Language Models for Lexical Semantics

Probing Context Localization of Polysemous Words in Pre-trained Language Model Sub-Layers

Probing Cross-Lingual Lexical Knowledge from Multilingual Sentence Encoders

Topic Aware Probing: From Sentence Length Prediction to Idiom Identification how reliant are Neural Language Models on Topic?

Probing Language Identity Encoded in Pre-Trained Multilingual Models: a Typological View.

A Latent-Variable Model for Intrinsic Probing

Probing Linguistic Information For Logical Inference In Pre-trained Language Models

A Matter of Framing: The Impact of Linguistic Formalism on Probing Results

Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis

Probing Language Models on Their Knowledge Source

How Large Language Models Encode Context Knowledge? A Layer-Wise Probing Study

Discourse Probing of Pretrained Language Models

Quantifying the Contextualization of Word Representations with Semantic Class Probing

Universal and Independent: Multilingual Probing Framework for Exhaustive Model Interpretation and Evaluation

Probing What Different NLP Tasks Teach Machines about Function Word Comprehension

Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Interpreting Language Models Through Knowledge Graph Extraction

Probing the Category of Verbal Aspect in Transformer Language Models

Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance?

What does it mean to be language-agnostic? Probing multilingual sentence encoders for typological properties