Abstract:Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance than their counterparts directly trained on the downstream tasks. In this work, we study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks. We propose to use artificially constructed datasets as the pre-training data to exclude the effect of semantics, and further control what characteristics the pre-training corpora have. By fine-tuning the pre-trained models on GLUE benchmark, we can learn how beneficial it is to transfer the knowledge from the model trained on the dataset possessing that specific trait. We define and discuss three different characteristics in the artificial dataset: 1) matching the token's uni-gram or bi-gram distribution between pre-training and downstream fine-tuning, 2) the presence of the explicit dependencies among the tokens in a sequence, 3) the length of the implicit dependencies among the tokens in a sequence. Our experiments show that the explicit dependencies in the sequences of the pre-training data are critical to the downstream performance. Our results also reveal that models achieve better downstream performance when pre-trained on a dataset with a longer range of implicit dependencies. Based on our analysis, we find that models pre-trained with artificial datasets are prone to learn spurious correlation in downstream tasks. Our work reveals that even if the LMs are not pre-trained on natural language, they still gain transferability on certain human language downstream tasks once the LMs learn to model the token dependencies in the sequences. This result helps us understand the exceptional transferability of pre-trained LMs.

Is BERT a Cross-Disciplinary Knowledge Learner? A Surprising Finding of Pre-trained Models' Transferability

Identifying the Limits of Cross-Domain Knowledge Transfer for Pretrained Models

On the Transferability of Pre-trained Language Models: A Study from Artificial Datasets

Can Fine-tuning Pre-trained Models Lead to Perfect NLP? A Study of the Generalizability of Relation Extraction.

A Study of Cross-Lingual Ability and Language-specific Information in Multilingual BERT

Story Ending Prediction by Transferable BERT

TransBERT: A Three-Stage Pre-training Technology for Story-Ending Prediction

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

Fantastic Gains and Where to Find Them: On the Existence and Prospect of General Knowledge Transfer between Any Pretrained Model

What makes multilingual BERT multilingual?

Exploring and Predicting Transferability across NLP Tasks

Investigating Pre-trained Language Models on Cross-Domain Datasets, a Step Closer to General AI

Commonsense Knowledge Transfer for Pre-trained Language Models

A Balanced Data Approach for Evaluating Cross-Lingual Transfer: Mapping the Linguistic Blood Bank

Languages You Know Influence Those You Learn: Impact of Language Characteristics on Multi-Lingual Text-to-Text Transfer

Interpreting Language Models Through Knowledge Graph Extraction

Can linguists better understand DNA?

Cross-Linguistic Syntactic Difference in Multilingual BERT: How Good is It and How Does It Affect Transfer?

Rethinking Two Consensuses of the Transferability in Deep Learning

K-BERT: Enabling Language Representation with Knowledge Graph

Measuring Cross-lingual Transfer in Bytes