Inconsistencies in Masked Language Models

Tom Young,Yunan Chen,Yang You
2024-02-23
Abstract:Learning to predict masked tokens in a sequence has been shown to be a helpful pretraining objective for powerful language models such as PaLM2. After training, such masked language models (MLMs) can provide distributions of tokens in the masked positions in a sequence. However, this paper shows that distributions corresponding to different masking patterns can demonstrate considerable inconsistencies, i.e., they cannot be derived from a coherent joint distribution when considered together. This fundamental flaw in MLMs can lead to self-contradictory behaviors during inference. On various benchmark datasets including MMLU, MLMs can give different predictions to the same input question. From BERT-base to UL2-20B, we show that such inconsistencies exist ubiquitously in MLMs of diverse sizes and configurations. In light of our observations, we further propose an inference-time strategy for MLMs called Ensemble of Conditionals. It jointly considers a selected range of inconsistent conditionals directly produced by the MLM for the final prediction, which often leads to considerable accuracy improvement.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the issue of inconsistency in the conditional distributions produced by Masked Language Models (MLMs) under different masking patterns. Specifically, the paper points out that although MLMs can learn the word distribution at masked positions given the context during pre-training, these distributions may not be derivable from a coherent joint distribution, leading to contradictory behavior during inference. For example, on multiple benchmark datasets, including MMLU, Lambada, and BigBench, MLMs may give different predictions for the same question. To address this issue, the authors first quantify the severity of this inconsistency through experiments and propose an inference-time strategy called Ensemble of Conditionals. This strategy makes the final prediction by combining multiple inconsistent conditional distributions directly produced by the MLM, thereby significantly improving the model's accuracy. The main contributions of the paper include: 1. Revealing a commonly overlooked flaw in MLMs, namely that they can exhibit inconsistent distributions depending on the masking pattern. 2. Quantifying this inconsistency on multiple benchmark datasets, such as in the multiple-choice questions of MMLU, where the two different distributions provided by UL2-20B disagree on the answer 14% of the time on average. 3. Demonstrating how integrating a large number of inconsistent conditional distributions can significantly improve the accuracy of these benchmarks. Additionally, the paper explores the differences between T5-style and BERT-style MLMs on this issue, further deepening the understanding of the intrinsic mechanisms of MLMs.