Language Model Pre-training with Linguistically Motivated Curriculum Learning

Yile Wang,Yue Zhang,Peng Li,Yang Liu
2023-01-01
Abstract:Pre-training serves as a foundation of recent NLP models, where language modeling task is performed over large texts. It has been shown that data affects the quality of pre-training, and curriculum has been investigated regarding sequence length. We consider a linguistic perspective in the curriculum, where frequent words are learned first and rare words last. This is achieved by substituting syntactic constituents for rare words with their constituent labels. By such syntactic substitutions, a curriculum can be made by gradually introducing words with decreasing frequency levels. Without modifying model architectures or introducing external computational overhead, our data-centric method gives better performances over vanilla BERT on various downstream benchmarks.
What problem does this paper attempt to address?