Abstract:BERT (Bidirectional Encoder Representations from Transformers) has revolutionized the field of natural language processing through its exceptional performance on numerous tasks. Yet, the majority of researchers have mainly concentrated on enhancements related to the model structure, such as relative position embedding and more efficient attention mechanisms. Others have delved into pretraining tricks associated with Masked Language Modeling, including whole word masking. DeBERTa introduced an enhanced decoder adapted for BERT's encoder model for pretraining, proving to be highly effective. We argue that the design and research around enhanced masked language modeling decoders have been underappreciated. In this paper, we propose several designs of enhanced decoders and introduce BPDec (BERT Pretraining Decoder), a novel method for modeling training. Typically, a pretrained BERT model is fine-tuned for specific Natural Language Understanding (NLU) tasks. In our approach, we utilize the original BERT model as the encoder, making only changes to the decoder without altering the encoder. This approach does not necessitate extensive modifications to the encoder architecture and can be seamlessly integrated into existing fine-tuning pipelines and services, offering an efficient and effective enhancement strategy. Compared to other methods, while we also incur a moderate training cost for the decoder during the pretraining process, our approach does not introduce additional training costs during the fine-tuning phase. We test multiple enhanced decoder structures after pretraining and evaluate their performance on the GLUE tasks and SQuAD tasks. Our results demonstrate that BPDec, having only undergone subtle refinements to the model structure during pretraining, significantly enhances model performance without escalating the finetuning cost, inference time and serving budget.

What problem does this paper attempt to address?

The paper aims to address the limitations of the pre-trained model BERT in the field of Natural Language Processing (NLP) regarding structural improvements, specifically optimizing the decoder part for the Masked Language Modeling (MLM) task. Specifically, the goals of the paper can be summarized as follows: 1. **Propose BPDec (BERT Pretraining Decoder)**: The paper proposes a new architecture called BPDec. The core idea is to add a specially designed decoder module to the original encoder of the BERT model to enhance BERT's ability to handle the Masked Language Modeling task during the pre-training phase. This decoder only functions during pre-training and does not add extra computational cost during fine-tuning and deployment. 2. **Improve BERT's performance**: By introducing BPDec, the paper aims to enhance the overall performance of the BERT model, especially in downstream natural language understanding tasks such as text classification and semantic similarity judgment, while maintaining or reducing computational costs compared to existing advanced models. 3. **Avoid additional computational burden**: Compared to existing high-performance models (such as DeBERTa), one of the design goals of BPDec is to improve model performance without significantly increasing computational resource consumption. This means that BPDec can not only enhance the model's capabilities during the pre-training phase but also maintain high efficiency during fine-tuning and practical applications. 4. **Explore BERT's potential**: The paper also explores ways to tap into BERT's potential performance improvement by changing the model structure, especially by introducing a decoder module, without needing to introduce complex mechanisms or add a large number of parameters like DeBERTa. In summary, this paper is mainly dedicated to improving the performance of the BERT model during the pre-training phase by introducing a novel decoder structure—BPDec, and ultimately enhancing its performance in various natural language processing tasks while minimizing the demand for computational resources.

BPDec: Unveiling the Potential of Masked Language Modeling Decoder in BERT pretraining

An Improved Mask Approach Based on Pointer Network for Domain Adaptation of BERT

NarrowBERT: Accelerating Masked Language Model Pretraining and Inference

FEDBFPT: an Efficient Federated Learning Framework for BERT Further Pre-Training

MPNet: Masked and Permuted Pre-training for Language Understanding

Pre-Training with Whole Word Masking for Chinese BERT

Incorporating BERT into Parallel Sequence Decoding with Adapters.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Drop your Decoder: Pre-training with Bag-of-Word Prediction for Dense Passage Retrieval

Bootstrapped Masked Autoencoders for Vision BERT Pretraining

DACBERT: Leveraging Dependency Agreement for Cost-Efficient Bert Pretraining

Boosting Distributed Training Performance of the Unpadded BERT Model

Efficient Training of BERT by Progressively Stacking.

MC-BERT: Efficient Language Pre-Training via a Meta Controller

BERT-JAM: Maximizing the Utilization of BERT for Neural Machine Translation

Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Using Selective Masking as a Bridge between Pre-training and Fine-tuning

Neuro-BERT: Rethinking Masked Autoencoding for Self-supervised Neurological Pretraining

Boosting Point-BERT by Multi-choice Tokens

BERTwich: Extending BERT's Capabilities to Model Dialectal and Noisy Text