Abstract:BERT (Bidirectional Encoder Representations from Transformers) has revolutionized the field of natural language processing through its exceptional performance on numerous tasks. Yet, the majority of researchers have mainly concentrated on enhancements related to the model structure, such as relative position embedding and more efficient attention mechanisms. Others have delved into pretraining tricks associated with Masked Language Modeling, including whole word masking. DeBERTa introduced an enhanced decoder adapted for BERT's encoder model for pretraining, proving to be highly effective. We argue that the design and research around enhanced masked language modeling decoders have been underappreciated. In this paper, we propose several designs of enhanced decoders and introduce BPDec (BERT Pretraining Decoder), a novel method for modeling training. Typically, a pretrained BERT model is fine-tuned for specific Natural Language Understanding (NLU) tasks. In our approach, we utilize the original BERT model as the encoder, making only changes to the decoder without altering the encoder. This approach does not necessitate extensive modifications to the encoder architecture and can be seamlessly integrated into existing fine-tuning pipelines and services, offering an efficient and effective enhancement strategy. Compared to other methods, while we also incur a moderate training cost for the decoder during the pretraining process, our approach does not introduce additional training costs during the fine-tuning phase. We test multiple enhanced decoder structures after pretraining and evaluate their performance on the GLUE tasks and SQuAD tasks. Our results demonstrate that BPDec, having only undergone subtle refinements to the model structure during pretraining, significantly enhances model performance without escalating the finetuning cost, inference time and serving budget.

An Improved Mask Approach Based on Pointer Network for Domain Adaptation of BERT

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.

Point Cloud Domain Adaptation Via Masked Local 3D Structure Prediction

Domain-oriented Language Pre-training with Adaptive Hybrid Masking and Optimal Transport Alignment

Using Selective Masking as a Bridge between Pre-training and Fine-tuning

Learning to share by masking the non-shared for multi-domain sentiment classification

Boosting Point-BERT by Multi-choice Tokens

Pre-Training with Whole Word Masking for Chinese BERT

DA-BERT: Enhancing Knowledge Selection in Dialog via Domain Adapted BERT with Dynamic Masking Probability

BPDec: Unveiling the Potential of Masked Language Modeling Decoder in BERT pretraining

POS-BERT: Point Cloud One-Stage BERT Pre-Training

Advancing Domain Adaptation of BERT by Learning Domain Term Semantics.

Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

MPNet: Masked and Permuted Pre-training for Language Understanding

NarrowBERT: Accelerating Masked Language Model Pretraining and Inference

A Joint Domain-Specific Pre-Training Method Based on Data Enhancement

Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting

Protum: A New Method For Prompt Tuning Based on "[MASK]"

Improving BERT-Based Text Classification With Auxiliary Sentence and Domain Knowledge

KGNER: Improving Chinese Named Entity Recognition by BERT Infused with the Knowledge Graph

Train No Evil: Selective Masking for Task-Guided Pre-Training