Abstract:BERT (Bidirectional Encoder Representations from Transformers) has revolutionized the field of natural language processing through its exceptional performance on numerous tasks. Yet, the majority of researchers have mainly concentrated on enhancements related to the model structure, such as relative position embedding and more efficient attention mechanisms. Others have delved into pretraining tricks associated with Masked Language Modeling, including whole word masking. DeBERTa introduced an enhanced decoder adapted for BERT's encoder model for pretraining, proving to be highly effective. We argue that the design and research around enhanced masked language modeling decoders have been underappreciated. In this paper, we propose several designs of enhanced decoders and introduce BPDec (BERT Pretraining Decoder), a novel method for modeling training. Typically, a pretrained BERT model is fine-tuned for specific Natural Language Understanding (NLU) tasks. In our approach, we utilize the original BERT model as the encoder, making only changes to the decoder without altering the encoder. This approach does not necessitate extensive modifications to the encoder architecture and can be seamlessly integrated into existing fine-tuning pipelines and services, offering an efficient and effective enhancement strategy. Compared to other methods, while we also incur a moderate training cost for the decoder during the pretraining process, our approach does not introduce additional training costs during the fine-tuning phase. We test multiple enhanced decoder structures after pretraining and evaluate their performance on the GLUE tasks and SQuAD tasks. Our results demonstrate that BPDec, having only undergone subtle refinements to the model structure during pretraining, significantly enhances model performance without escalating the finetuning cost, inference time and serving budget.

NextLevelBERT: Masked Language Modeling with Higher-Level Representations for Long Documents

Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

Investigating Masking-based Data Generation in Language Models

NarrowBERT: Accelerating Masked Language Model Pretraining and Inference

Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching

Larger-Scale Transformers for Multilingual Masked Language Modeling

Typhoon: Towards an Effective Task-Specific Masking Strategy for Pre-trained Language Models

Exploration of Masked and Causal Language Modelling for Text Generation

Unmasking the Limits of Large Language Models: A Systematic Evaluation of Masked Text Processing Ability through MskQA and MskCal

A Masked Segmental Language Model for Unsupervised Natural Language Segmentation

StructFormer: Document Structure-based Masked Attention and its Impact on Language Model Pre-Training

Language Model Adaptation to Specialized Domains through Selective Masking based on Genre and Topical Characteristics

Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders

Weighted Sampling for Masked Language Modeling

A Sentence-level Hierarchical BERT Model for Document Classification with Limited Labelled Data

Improving Requirements Completeness: Automated Assistance through Large Language Models

Word-Level Representation From Bytes For Language Modeling

BPDec: Unveiling the Potential of Masked Language Modeling Decoder in BERT pretraining

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

Boosting Point-BERT by Multi-choice Tokens