Should You Mask 15% in Masked Language Modeling?

Alexander Wettig,Tianyu Gao,Zexuan Zhong,Danqi Chen
DOI: https://doi.org/10.48550/arXiv.2202.08005
2023-02-10
Abstract:Masked language models (MLMs) conventionally mask 15% of tokens due to the belief that more masking would leave insufficient context to learn good representations; this masking rate has been widely used, regardless of model sizes or masking strategies. In this work, we revisit this important choice of MLM pre-training. We first establish that 15% is not universally optimal, and larger models should adopt a higher masking rate. Specifically, we find that masking 40% outperforms 15% for BERT-large size models on GLUE and SQuAD. Interestingly, an extremely high masking rate of 80% can still preserve 95% fine-tuning performance and most of the accuracy in linguistic probing, challenging the conventional wisdom about the role of the masking rate. We then examine the interplay between masking rates and masking strategies and find that uniform masking requires a higher masking rate compared to sophisticated masking strategies such as span or PMI masking. Finally, we argue that increasing the masking rate has two distinct effects: it leads to more corruption, which makes the prediction task more difficult; it also enables more predictions, which benefits optimization. Using this framework, we revisit BERT's 80-10-10 corruption strategy. Together, our results contribute to a better understanding of MLM pre-training.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to explore and solve is about the selection of the masking rate in Masked Language Models (MLMs). Traditional MLMs usually adopt a 15% masking rate, which is based on the assumption that a higher masking rate will fail to learn good representations due to a lack of sufficient context; while a too - low masking rate will make the training inefficient. However, there has been little research on whether this 15% masking rate is still applicable in different model sizes, masking strategies and optimization schemes. Therefore, this paper aims to re - examine this important choice and specifically discuss the following points: 1. **The influence of different model sizes on the optimal masking rate**: The paper first verifies that the 15% masking rate is not universally optimal. In particular, for large - scale models, a higher masking rate should be adopted. The experimental results show that for the BERT - large - scale model, on the GLUE and SQuAD tasks, a 40% masking rate is superior to 15%. 2. **The effect of extremely high masking rates**: Interestingly, even with a masking rate as high as 80%, the model can still maintain 95% of the fine - tuning performance and perform well in the language probe test, which challenges the traditional view that a high masking rate will cause the model to be unable to learn effective representations. 3. **The interaction between different masking strategies and masking rates**: The paper further studies the relationship between different masking strategies (such as uniform masking, span masking and PMI masking) and masking rates, and finds that uniform masking requires a higher masking rate to achieve performance comparable to other more complex masking strategies. 4. **Two effects of the masking rate**: The author proposes that increasing the masking rate has two independent effects: on the one hand, it causes more input to be corrupted, making the prediction task more difficult; on the other hand, it also increases the number of predictions, which is helpful for the optimization process. Through this framework, the author re - evaluates the 80 - 10 - 10 corruption strategy of BERT. In general, through a series of experiments and analyses, this paper reveals the important role of the masking rate in MLM pre - training and proposes a more flexible method for selecting the masking rate, providing new perspectives and directions for future research.