NextLevelBERT: Masked Language Modeling with Higher-Level Representations for Long Documents

Tamara Czinczoll,Christoph Hönes,Maximilian Schall,Gerard de Melo
2024-06-13
Abstract:While (large) language models have significantly improved over the last years, they still struggle to sensibly process long sequences found, e.g., in books, due to the quadratic scaling of the underlying attention mechanism. To address this, we propose NextLevelBERT, a Masked Language Model operating not on tokens, but on higher-level semantic representations in the form of text embeddings. We pretrain NextLevelBERT to predict the vector representation of entire masked text chunks and evaluate the effectiveness of the resulting document vectors on three types of tasks: 1) Semantic Textual Similarity via zero-shot document embeddings, 2) Long document classification, 3) Multiple-choice question answering. We find that next-level Masked Language Modeling is an effective technique to tackle long-document use cases and can outperfor much larger embedding models as long as the required level of detail of semantic information is not too fine. Our models and code are publicly available online.
Computation and Language
What problem does this paper attempt to address?
This paper aims to address the challenges faced by large language models when processing long documents, particularly the performance bottleneck caused by the quadratic complexity of the attention mechanism. The authors propose a new masked language model, NextLevelBERT, which operates on higher-level semantic representations of text embeddings rather than directly at the token level. In this way, NextLevelBERT achieves better performance in long document tasks, including zero-shot semantic text similarity, long document classification, and multiple-choice question answering. Experimental results show that NextLevelBERT can outperform other large embedding models without requiring overly fine semantic details, and its parameter count is only one-third of the strongest competitor. Additionally, the paper explores the impact of different chunking strategies on model performance and the importance of initialization methods during pre-training. In summary, NextLevelBERT provides an effective approach to handling long documents and demonstrates superior performance across multiple tasks.