Abstract:BERT (Bidirectional Encoder Representations from Transformers) has revolutionized the field of natural language processing through its exceptional performance on numerous tasks. Yet, the majority of researchers have mainly concentrated on enhancements related to the model structure, such as relative position embedding and more efficient attention mechanisms. Others have delved into pretraining tricks associated with Masked Language Modeling, including whole word masking. DeBERTa introduced an enhanced decoder adapted for BERT's encoder model for pretraining, proving to be highly effective. We argue that the design and research around enhanced masked language modeling decoders have been underappreciated. In this paper, we propose several designs of enhanced decoders and introduce BPDec (BERT Pretraining Decoder), a novel method for modeling training. Typically, a pretrained BERT model is fine-tuned for specific Natural Language Understanding (NLU) tasks. In our approach, we utilize the original BERT model as the encoder, making only changes to the decoder without altering the encoder. This approach does not necessitate extensive modifications to the encoder architecture and can be seamlessly integrated into existing fine-tuning pipelines and services, offering an efficient and effective enhancement strategy. Compared to other methods, while we also incur a moderate training cost for the decoder during the pretraining process, our approach does not introduce additional training costs during the fine-tuning phase. We test multiple enhanced decoder structures after pretraining and evaluate their performance on the GLUE tasks and SQuAD tasks. Our results demonstrate that BPDec, having only undergone subtle refinements to the model structure during pretraining, significantly enhances model performance without escalating the finetuning cost, inference time and serving budget.

RoBERTa: A Robustly Optimized BERT Pretraining Approach

bert2BERT: Towards Reusable Pretrained Language Models

ExtremeBERT: A Toolkit for Accelerating Pretraining of Customized BERT

MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

EarlyBERT: Efficient BERT Training Via Early-bird Lottery Tickets

Breaking MLPerf Training: A Case Study on Optimizing BERT

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding

DACBERT: Leveraging Dependency Agreement for Cost-Efficient Bert Pretraining

Efficient Fine-Tuning of Compressed Language Models with Learners

Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually)

Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting

RoChBert: Towards Robust BERT Fine-tuning for Chinese

NarrowBERT: Accelerating Masked Language Model Pretraining and Inference

Empirical Analysis of Efficient Fine-Tuning Methods for Large Pre-Trained Language Models

BPDec: Unveiling the Potential of Masked Language Modeling Decoder in BERT pretraining

ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques

ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning

Towards Structured Dynamic Sparse Pre-Training of BERT