Abstract:Randomness is widely introduced in neural network training to simplify model optimization or avoid the over-fitting problem. Among them, dropout and its variations in different aspects (e.g., data, model structure) are prevalent in regularizing the training of deep neural networks. Though effective and performing well, the randomness introduced by these dropout-based methods causes nonnegligible inconsistency between training and inference. In this paper, we introduce a simple consistency training strategy to regularize such randomness, namely R-Drop, which forces two output distributions sampled by each type of randomness to be consistent. Specifically, R-Drop minimizes the bidirectional KL-divergence between two output distributions produced by dropout-based randomness for each training sample. Theoretical analysis reveals that R-Drop can reduce the above inconsistency by reducing the inconsistency among the sampled sub structures and bridging the gap between the loss calculated by the full model and sub structures. Experiments on 7 widely-used deep learning tasks ( 23 datasets in total) demonstrate that R-Drop is universally effective for different types of neural networks (i.e., feed-forward, recurrent, and graph neural networks) and different learning paradigms (supervised, parameter-efficient, and semi-supervised). In particular, it achieves state-of-the-art performances with the vanilla Transformer model on WMT14 English → German translation ( 30.91 BLEU) and WMT14 English → French translation ( 43.95 BLEU), even surpassing models trained with extra large-scale data and expert-designed advanced variants of Transformer models. Our code is available at GitHub https://github.com/dropreg/R-Drop.

Dropout Token To Improve Neural Language Model

Wordreg: Mitigating the Gap Between Training and Inference with Worst-Case Drop Regularization

Analysing Dropout and Compounding Errors in Neural Language Models

R-Drop: Regularized Dropout for Neural Networks.

Token Drop Mechanism for Neural Machine Translation.

R-drop: Regularized dropout for neural networks

Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks

A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

Dropout: a simple way to prevent neural networks from overfitting

Layer-wise Regularized Dropout for Neural Language Models

AutoDropout: Learning Dropout Patterns to Regularize Deep Networks

Investigating the Synergistic Effects of Dropout and Residual Connections on Language Model Training

Continuous Dropout

AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning

Gradient-based Dynamic Dropout

Adaptive Dropout Method Based on Biological Principles

An Improved Dropout Method And Its Application Into Dbn-Based Handwriting Recognition

UniDrop: A Simple Yet Effective Technique to Improve Transformer Without Extra Cost.

Randomness Regularization with Simple Consistency Training for Neural Networks

Recurrent Neural Network Regularization

Bi-Drop: Enhancing Fine-tuning Generalization Via Synchronous Sub-Net Estimation and Optimization