Abstract:Beam search with masked language models (MLMs) is challenging in part because joint probability distributions over sequences are not readily available, unlike for autoregressive models. However, estimating such distributions has important domain-specific applications such as ancient text restoration and protein engineering. Here we present probabilistically-sound methods for beam search with MLMs. First, we clarify the conditions under which it is theoretically sound to perform text infilling with MLMs using standard beam search. When these conditions fail, we provide a probabilistically-sound inference time modification with no additional computational complexity and demonstrate that it is superior to the aforementioned beam search in the expected conditions. We then present empirical results comparing several infilling approaches with MLMs across several domains. Notably, our method probes the inductive biases of MLMs and explores the surprising contextual sensitivity of mask tokens for text infilling.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the difficult problem of calculating the probability joint distribution when using Masked Language Models (MLMs) for text infilling. Specifically, MLMs learn the conditional distribution of individual tokens, while many practical applications (such as ancient text restoration and protein engineering) require an understanding of the joint probability distribution of multiple tokens. #### Main problems: 1. **Calculation of joint probability distribution**: Unlike autoregressive models, MLMs do not directly provide the joint probability distribution \( p(x) \) of the sequence, but learn the conditional distribution \( p_i(x_i | x_{-i}) \) at each position, where \( x_{-i} \) represents the context sequence with the \( i \) -th position removed. 2. **Conditional independence assumption**: Existing methods usually rely on the conditional independence assumption, that is, \( p(x_i | x_{:i}, [M]_{i:}) \approx p(x_i | x_{:i}) \), which may not hold in practice because passing masked tokens can have an impact on the output distribution. 3. **Theoretical rationality**: When the conditional independence assumption does not hold, how to perform beam search in a probabilistically rational way without additional computational complexity. #### Solutions: The paper proposes a correction method based on the Hammersley - Clifford - Besag (HCB) theorem, called HCB beam search. This method corrects potential dependencies by introducing an adjustment term, thereby ensuring probabilistic rationality. The specific formula is: \[ \log p(x_j:k | x_{:j}, x_k:) \sim \sum_{i = j}^k \log p(x_i | x_{:i}, [M]_{i:k}, x_k:) - \log p([M]_i | x_{:i}, [M]_{i:k}, x_k:) \] where \(\sim\) means that it is equivalent to adding a constant term \( \log p([M]_{i:k} | x_{:j}, x_k:) \) on the tokens \( x_j:k \) to be filled. In addition, the paper also provides experimental evidence showing that HCB beam search is superior to standard beam search in multiple tasks and models, and explores the context - sensitivity of masked tokens in text infilling. ### Summary: The main contribution of this paper is to provide a probabilistically rational beam search method, solve the problem that the joint probability distribution of MLMs is difficult to calculate in text - filling tasks, and prove its effectiveness through experiments.

Towards Probabilistically-Sound Beam Search with Masked Language Models

A Better Way to Do Masked Language Model Scoring

Inconsistencies in Masked Language Models

Representation Deficiency in Masked Language Modeling

Looking Right is Sometimes Right: Investigating the Capabilities of Decoder-only LLMs for Sequence Labeling

Analysing the Effect of Masking Length Distribution of MLM: An Evaluation Framework and Case Study on Chinese MRC Datasets

Masking-based Neural Beamformer for Multichannel Speech Enhancement

Exploration of Masked and Causal Language Modelling for Text Generation

Unsupervised Representation Learning of Player Behavioral Data with Confidence Guided Masking

Learning Better Masking for Better Language Model Pre-training

Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis

Beam Prediction based on Large Language Models

Confidence-Aware Sub-Structure Beam Search (CABS): Mitigating Hallucination in Structured Data Generation with Large Language Models

Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

Investigating Label Bias in Beam Search for Open-ended Text Generation

Improving Self-supervised Pre-training via a Fully-Explored Masked Language Model

Improved Beam Search for Hallucination Mitigation in Abstractive Summarization

Focused learning by antibody language models using preferential masking of non-templated regions

Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines

Dynamic Benchmarking of Masked Language Models on Temporal Concept Drift with Multiple Views

Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach