Abstract:Sequence-to-sequence (seq2seq) learning is a popular fashion for large-scale pretraining language models. However, the prior seq2seq pretraining models generally focus on reconstructive objectives on the decoder side and neglect the effect of encoder-side supervision, which we argue may lead to sub-optimal performance. To verify our hypothesis, we first empirically study the functionalities of the encoder and decoder in seq2seq pretrained language models, and find that the encoder takes an important but under-exploitation role than the decoder regarding the downstream performance and neuron activation. Therefore, we propose an encoding-enhanced seq2seq pretraining strategy, namely E2S2, which improves the seq2seq models via integrating more efficient self-supervised information into the encoders. Specifically, E2S2 adopts two self-supervised objectives on the encoder side from two aspects: 1) locally denoising the corrupted sentence (denoising objective); and 2) globally learning better sentence representations (contrastive objective). With the help of both objectives, the encoder can effectively distinguish the noise tokens and capture high-level (i.e., syntactic and semantic) knowledge, thus strengthening the ability of seq2seq model to accurately achieve the conditional generation. On a large diversity of downstream natural language understanding and generation tasks, E2S2 dominantly improves the performance of its powerful backbone models, e.g., BART and T5. For example, upon BART backbone, we achieve +1.1% averaged gain on the general language understanding evaluation (GLUE) benchmark and +1.75% F_0.5 score improvement on CoNLL2014 dataset. We also provide in-depth analyses to show the improvement stems from better linguistic representation. We hope that our work will foster future self-supervision research on seq2seq language model pretraining.

PARADISE: Exploiting Parallel Data for Multilingual Sequence-to-Sequence Pretraining

PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation

Iterative Task-adaptive Pretraining for Unsupervised Word Alignment

Multilingual Denoising Pre-training for Neural Machine Translation

Multimodal Pretraining from Monolingual to Multilingual

Bridging Cross-Lingual Gaps During Leveraging the Multilingual Sequence-to-Sequence Pretraining for Text Generation and Understanding

Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data

A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models

Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data

Understanding and Improving Sequence-to-Sequence Pretraining for Neural Machine Translation

Incorporating BERT into Parallel Sequence Decoding with Adapters.

Pre-training via Leveraging Assisting Languages and Data Selection for Neural Machine Translation

Boosting Unsupervised Machine Translation with Pseudo-Parallel Data

DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders

E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language Understanding and Generation

On the Complementarity between Pre-Training and Back-Translation for Neural Machine Translation

Word Translation Without Parallel Data

DEPT: Decoupled Embeddings for Pre-training Language Models

Bilingual Dictionary-based Language Model Pretraining for Neural Machine Translation

Improving Neural Machine Translation Models with Monolingual Data

Improving Language Transfer Capability of Decoder-only Architecture in Multilingual Neural Machine Translation