Abstract:ive Summarization We evaluate MINILM on two abstractive summarization datasets, i.e., XSum [22], and the non-anonymized version of CNN/DailyMail [30]. The generation task is to condense a document into a concise and fluent summary, while conveying its key information. We report ROUGE scores [18] on the datasets. Table 3 presents the results of MINILM, baseline, several state-of-the-art models and pre-trained Transformer models. Our 12x384 model outperforms BERT based method BERTSUMABS [19] and the pre-trained sequence-to-sequence model MASSBASE [31] with much fewer parameters. Moreover, our 6x384 MINILM also achieves competitive performance. ∗ Contact person. skylion007.github.io/OpenWebTextCorpus 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada. Table 1: The results of MINILM distilled from an in-house pre-trained Transformer model (BERTBASE size, 12-layer Transformer, 768-hidden size, and 12 self-attention heads) on SQuAD 2.0 and GLUE benchmark. We report the results of our 12-layer and 6-layer models with 384 hidden size. The fine-tuning results are averaged over 4 runs. Model #Param SQuAD2 MNLI-m SST-2 QNLI CoLA RTE MRPC QQP Average BERTBASE 109M 76.8 84.5 93.2 91.7 58.9 68.6 87.3 91.3 81.5 MINILM 33M 81.7 85.7 93.0 91.5 58.5 73.3 89.5 91.3 83.1 MINILM (w/ TA) 22M 75.6 83.3 91.5 90.5 47.5 68.8 88.9 90.6 79.6 Table 2: Question generation results of our 12-layer and 6-layer models with 384 hidden size on SQuAD 1.1. The first block follows the data split in Du and Cardie [7], while the second block is the same as in Zhao et al. [44]. #Param BLEU-4 METEOR ROUGE-L Du and Cardie [7] 15.16 19.12 Zhang and Bansal [43] 18.37 22.65 46.68 UNILMLARGE 340M 22.78 25.49 51.57 MINILM 33M 21.07 24.09 49.14 MINILM (w/ TA) 22M 20.31 23.43 48.21 Zhao et al. [44] 16.38 20.25 44.48 Zhang and Bansal [43] 20.76 24.20 48.91 UNILMLARGE 340M 24.32 26.10 52.69 MINILM 33M 23.27 25.15 50.60 MINILM (w/ TA) 22M 22.01 24.24 49.51 1.2 Multilingual MINILM We present the number of Transformer and embedding parameters for different multilingual pretrained models and our distilled models in Table 4. We also report the XNLI results for each language in Table 5, MLQA results for each language in Table 6. 1.3 Supplementary Ablation Studies Table 7 presents the comparison between transferring value relation and transferring hidden states using MSE. We use self-attention distributions and hidden states of teacher’s last Transformer layer to guide the training of the student model (Hidden-MSE). A parameter matrix is introduced to transform student hidden states to have the same size as the teacher hidden states. Using value relation performs better than transferring hidden states. Transferring value relation avoids additional transformation and introduces more knowledge of word dependencies. We have also tried to transfer the relation between hidden states instead of directly transferring vectors. But we find the performance of student models are unstable for different teacher models. To study the rationale behind transferring teacher’s last Transformer layer, we compare more strategies of mapping teacher and student layers. We conduct experiments using a 3-layer student model with 384 hidden size. Besides transferring teacher’s knowledge to the last student layer and all three student layers (adopt a uniform strategy to map each teacher and student layers), we use the uniform strategy to determine the mapping of teacher and student layers but only transfer teacher’s knowledge of corresponding layers to the last two layers, first and last two layers of the student model. Table 8 shows the results of different strategies. Transferring the last layer performs better than the strategies using two layers. Transferring two layers achieves better performance than transferring all three layers. Relaxing restrictions of layer mapping improves performance. Given the student always has fewer number of layers, the knowledge ideally learned at each student layer may be different from the knowledge of corresponding layers of the teacher model. Only transferring teacher’s last layer gives the student more flexibility to learn the knowledge.

Supplementary Material: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Teacher outputs Student outputs Teacher ? Student ? ! ! " !

A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models

HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers

One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Self-Improving Teacher Cultivates Better Student: Distillation Calibration for Multimodal Large Language Models

Multi-Teacher Distillation With Single Model for Neural Machine Translation

Understanding the Difficulty of Training Transformers

Transformer Layer Injection: A Novel Approach for Efficient Upscaling of Large Language Models

Deep Transformers with Latent Depth

Two Independent Teachers are Better Role Model

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Layerwised multimodal knowledge distillation for vision-language pretrained model

Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA

Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models

Knowledge Distillation Meets Self-Supervision

Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules

On The Adaptation of Unlimiformer for Decoder-Only Transformers