Towards Probabilistically-Sound Beam Search with Masked Language Models

Creston Brooks,Robert Calef,Charlie Cowen-Breen,Anna Sappington
2024-10-10
Abstract:Beam search with masked language models (MLMs) is challenging in part because joint probability distributions over sequences are not readily available, unlike for autoregressive models. However, estimating such distributions has important domain-specific applications such as ancient text restoration and protein engineering. Here we present probabilistically-sound methods for beam search with MLMs. First, we clarify the conditions under which it is theoretically sound to perform text infilling with MLMs using standard beam search. When these conditions fail, we provide a probabilistically-sound inference time modification with no additional computational complexity and demonstrate that it is superior to the aforementioned beam search in the expected conditions. We then present empirical results comparing several infilling approaches with MLMs across several domains. Notably, our method probes the inductive biases of MLMs and explores the surprising contextual sensitivity of mask tokens for text infilling.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the difficult problem of calculating the probability joint distribution when using Masked Language Models (MLMs) for text infilling. Specifically, MLMs learn the conditional distribution of individual tokens, while many practical applications (such as ancient text restoration and protein engineering) require an understanding of the joint probability distribution of multiple tokens. #### Main problems: 1. **Calculation of joint probability distribution**: Unlike autoregressive models, MLMs do not directly provide the joint probability distribution \( p(x) \) of the sequence, but learn the conditional distribution \( p_i(x_i | x_{-i}) \) at each position, where \( x_{-i} \) represents the context sequence with the \( i \) -th position removed. 2. **Conditional independence assumption**: Existing methods usually rely on the conditional independence assumption, that is, \( p(x_i | x_{:i}, [M]_{i:}) \approx p(x_i | x_{:i}) \), which may not hold in practice because passing masked tokens can have an impact on the output distribution. 3. **Theoretical rationality**: When the conditional independence assumption does not hold, how to perform beam search in a probabilistically rational way without additional computational complexity. #### Solutions: The paper proposes a correction method based on the Hammersley - Clifford - Besag (HCB) theorem, called HCB beam search. This method corrects potential dependencies by introducing an adjustment term, thereby ensuring probabilistic rationality. The specific formula is: \[ \log p(x_j:k | x_{:j}, x_k:) \sim \sum_{i = j}^k \log p(x_i | x_{:i}, [M]_{i:k}, x_k:) - \log p([M]_i | x_{:i}, [M]_{i:k}, x_k:) \] where \(\sim\) means that it is equivalent to adding a constant term \( \log p([M]_{i:k} | x_{:j}, x_k:) \) on the tokens \( x_j:k \) to be filled. In addition, the paper also provides experimental evidence showing that HCB beam search is superior to standard beam search in multiple tasks and models, and explores the context - sensitivity of masked tokens in text infilling. ### Summary: The main contribution of this paper is to provide a probabilistically rational beam search method, solve the problem that the joint probability distribution of MLMs is difficult to calculate in text - filling tasks, and prove its effectiveness through experiments.