Abstract:Pre-trained models have brought significant improvements to many NLP tasks and have been extensively analyzed. But little is known about the effect of fine-tuning on specific tasks. Intuitively, people may agree that a pre-trained model already learns semantic representations of words (e.g. synonyms are closer to each other) and fine-tuning further improves its capabilities which require more complicated reasoning (e.g. coreference resolution, entity boundary detection, etc). However, how to verify these arguments analytically and quantitatively is a challenging task and there are few works focus on this topic. In this paper, inspired by the observation that most probing tasks involve identifying matched pairs of phrases (e.g. coreference requires matching an entity and a pronoun), we propose a pairwise probe to understand BERT fine-tuning on the machine reading comprehension (MRC) task. Specifically, we identify five phenomena in MRC. According to pairwise probing tasks, we compare the performance of each layer's hidden representation of pre-trained and fine-tuned BERT. The proposed pairwise probe alleviates the problem of distraction from inaccurate model training and makes a robust and quantitative comparison. Our experimental analysis leads to highly confident conclusions: (1) Fine-tuning has little effect on the fundamental and low-level information and general semantic tasks. (2) For specific abilities required for downstream tasks, fine-tuned BERT is better than pre-trained BERT and such gaps are obvious after the fifth layer.

Discourse Probing of Pretrained Language Models

Probing Pretrained Language Models for Lexical Semantics

How Does Pretraining Improve Discourse-Aware Translation?

Towards Understanding Large-Scale Discourse Structures in Pre-Trained and Fine-Tuned Language Models

A Matter of Framing: The Impact of Linguistic Formalism on Probing Results

Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis

Probing Linguistic Information For Logical Inference In Pre-trained Language Models

Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling

Latent Causal Probing: A Formal Perspective on Probing with Causal Models of Data

Probing What Different NLP Tasks Teach Machines about Function Word Comprehension

MLPs Compass: What is learned when MLPs are combined with PLMs?

Pretraining with Contrastive Sentence Objectives Improves Discourse Performance of Language Models

Rewire-then-Probe: A Contrastive Recipe for Probing Biomedical Knowledge of Pre-trained Language Models

Probing via Prompting

Topic Aware Probing: From Sentence Length Prediction to Idiom Identification how reliant are Neural Language Models on Topic?

Can We Use Probing to Better Understand Fine-tuning and Knowledge Distillation of the BERT NLU?

Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance?

Multi-Source Probing for Open-Domain Conversational Understanding

Universal and Independent: Multilingual Probing Framework for Exhaustive Model Interpretation and Evaluation

A Pairwise Probe for Understanding BERT Fine-Tuning on Machine Reading Comprehension

A Side-by-side Comparison of Transformers for English Implicit Discourse Relation Classification