Inconsistencies in Masked Language Models

Tom Young,Yunan Chen,Yang You

2024-02-23

Abstract:Learning to predict masked tokens in a sequence has been shown to be a helpful pretraining objective for powerful language models such as PaLM2. After training, such masked language models (MLMs) can provide distributions of tokens in the masked positions in a sequence. However, this paper shows that distributions corresponding to different masking patterns can demonstrate considerable inconsistencies, i.e., they cannot be derived from a coherent joint distribution when considered together. This fundamental flaw in MLMs can lead to self-contradictory behaviors during inference. On various benchmark datasets including MMLU, MLMs can give different predictions to the same input question. From BERT-base to UL2-20B, we show that such inconsistencies exist ubiquitously in MLMs of diverse sizes and configurations. In light of our observations, we further propose an inference-time strategy for MLMs called Ensemble of Conditionals. It jointly considers a selected range of inconsistent conditionals directly produced by the MLM for the final prediction, which often leads to considerable accuracy improvement.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper attempts to address the issue of inconsistency in the conditional distributions produced by Masked Language Models (MLMs) under different masking patterns. Specifically, the paper points out that although MLMs can learn the word distribution at masked positions given the context during pre-training, these distributions may not be derivable from a coherent joint distribution, leading to contradictory behavior during inference. For example, on multiple benchmark datasets, including MMLU, Lambada, and BigBench, MLMs may give different predictions for the same question. To address this issue, the authors first quantify the severity of this inconsistency through experiments and propose an inference-time strategy called Ensemble of Conditionals. This strategy makes the final prediction by combining multiple inconsistent conditional distributions directly produced by the MLM, thereby significantly improving the model's accuracy. The main contributions of the paper include: 1. Revealing a commonly overlooked flaw in MLMs, namely that they can exhibit inconsistent distributions depending on the masking pattern. 2. Quantifying this inconsistency on multiple benchmark datasets, such as in the multiple-choice questions of MMLU, where the two different distributions provided by UL2-20B disagree on the answer 14% of the time on average. 3. Demonstrating how integrating a large number of inconsistent conditional distributions can significantly improve the accuracy of these benchmarks. Additionally, the paper explores the differences between T5-style and BERT-style MLMs on this issue, further deepening the understanding of the intrinsic mechanisms of MLMs.

Inconsistencies in Masked Language Models

Representation Deficiency in Masked Language Modeling

PMI-Masking: Principled masking of correlated spans

Learning Better Masking for Better Language Model Pre-training

Should You Mask 15% in Masked Language Modeling?

Improving Self-supervised Pre-training via a Fully-Explored Masked Language Model

Uniform Masking Prevails in Vision-Language Pretraining

Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

Isotropy-Enhanced Conditional Masked Language Models

How does the task complexity of masked pretraining objectives affect downstream performance?

A Better Way to Do Masked Language Model Scoring

On the Influence of Masking Policies in Intermediate Pre-training

A Predictive Factor Analysis of Social Biases and Task-Performance in Pretrained Masked Language Models

Masked Language Modeling Becomes Conditional Density Estimation for Tabular Data Synthesis

Analysing the Effect of Masking Length Distribution of MLM: An Evaluation Framework and Case Study on Chinese MRC Datasets

Investigating Masking-based Data Generation in Language Models

Masked Language Models Know Which Are Popular: A Simple Ranking Strategy for Commonsense Question Answering.

Weighted Sampling for Masked Language Modeling

Measuring Social Biases in Masked Language Models by Proxy of Prediction Quality

Probabilistically Masked Language Model Capable of Autoregressive Generation in Arbitrary Word Order

Towards Probabilistically-Sound Beam Search with Masked Language Models