Abstract:Masked language models (MLMs) conventionally mask 15% of tokens due to the belief that more masking would leave insufficient context to learn good representations; this masking rate has been widely used, regardless of model sizes or masking strategies. In this work, we revisit this important choice of MLM pre-training. We first establish that 15% is not universally optimal, and larger models should adopt a higher masking rate. Specifically, we find that masking 40% outperforms 15% for BERT-large size models on GLUE and SQuAD. Interestingly, an extremely high masking rate of 80% can still preserve 95% fine-tuning performance and most of the accuracy in linguistic probing, challenging the conventional wisdom about the role of the masking rate. We then examine the interplay between masking rates and masking strategies and find that uniform masking requires a higher masking rate compared to sophisticated masking strategies such as span or PMI masking. Finally, we argue that increasing the masking rate has two distinct effects: it leads to more corruption, which makes the prediction task more difficult; it also enables more predictions, which benefits optimization. Using this framework, we revisit BERT's 80-10-10 corruption strategy. Together, our results contribute to a better understanding of MLM pre-training.

What problem does this paper attempt to address?

The problem that this paper attempts to explore and solve is about the selection of the masking rate in Masked Language Models (MLMs). Traditional MLMs usually adopt a 15% masking rate, which is based on the assumption that a higher masking rate will fail to learn good representations due to a lack of sufficient context; while a too - low masking rate will make the training inefficient. However, there has been little research on whether this 15% masking rate is still applicable in different model sizes, masking strategies and optimization schemes. Therefore, this paper aims to re - examine this important choice and specifically discuss the following points: 1. **The influence of different model sizes on the optimal masking rate**: The paper first verifies that the 15% masking rate is not universally optimal. In particular, for large - scale models, a higher masking rate should be adopted. The experimental results show that for the BERT - large - scale model, on the GLUE and SQuAD tasks, a 40% masking rate is superior to 15%. 2. **The effect of extremely high masking rates**: Interestingly, even with a masking rate as high as 80%, the model can still maintain 95% of the fine - tuning performance and perform well in the language probe test, which challenges the traditional view that a high masking rate will cause the model to be unable to learn effective representations. 3. **The interaction between different masking strategies and masking rates**: The paper further studies the relationship between different masking strategies (such as uniform masking, span masking and PMI masking) and masking rates, and finds that uniform masking requires a higher masking rate to achieve performance comparable to other more complex masking strategies. 4. **Two effects of the masking rate**: The author proposes that increasing the masking rate has two independent effects: on the one hand, it causes more input to be corrupted, making the prediction task more difficult; on the other hand, it also increases the number of predictions, which is helpful for the optimization process. Through this framework, the author re - evaluates the 80 - 10 - 10 corruption strategy of BERT. In general, through a series of experiments and analyses, this paper reveals the important role of the masking rate in MLM pre - training and proposes a more flexible method for selecting the masking rate, providing new perspectives and directions for future research.

Should You Mask 15% in Masked Language Modeling?

Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

Uniform Masking Prevails in Vision-Language Pretraining

Learning Better Masking for Better Language Model Pre-training

Inconsistencies in Masked Language Models

PMI-Masking: Principled masking of correlated spans

Representation Deficiency in Masked Language Modeling

Improving Self-supervised Pre-training via a Fully-Explored Masked Language Model

How does the task complexity of masked pretraining objectives affect downstream performance?

Weighted Sampling for Masked Language Modeling

A Better Way to Do Masked Language Model Scoring

"Is Whole Word Masking Always Better for Chinese BERT?": Probing on Chinese Grammatical Error Correction

Using Selective Masking as a Bridge between Pre-training and Fine-tuning

Analysing the Effect of Masking Length Distribution of MLM: An Evaluation Framework and Case Study on Chinese MRC Datasets

Retrieval Oriented Masking Pre-training Language Model for Dense Passage Retrieval

Investigating Masking-based Data Generation in Language Models

A Predictive Factor Analysis of Social Biases and Task-Performance in Pretrained Masked Language Models

Unsupervised Representation Learning of Player Behavioral Data with Confidence Guided Masking

InforMask: Unsupervised Informative Masking for Language Model Pretraining

Self-Evolution Learning for Discriminative Language Model Pretraining.

Improving Source Code Pre-training Via Type-Specific Masking