Abstract:Massively multilingual Transformers (MMTs), such as mBERT and XLM-R, are widely used for cross-lingual transfer learning. While these are pretrained to represent hundreds of languages, end users of NLP systems are often interested only in individual languages. For such purposes, the MMTs' language coverage makes them unnecessarily expensive to deploy in terms of model size, inference time, energy, and hardware cost. We thus propose to extract compressed, language-specific models from MMTs which retain the capacity of the original MMTs for cross-lingual transfer. This is achieved by distilling the MMT bilingually, i.e., using data from only the source and target language of interest. Specifically, we use a two-phase distillation approach, termed BiStil: (i) the first phase distils a general bilingual model from the MMT, while (ii) the second, task-specific phase sparsely fine-tunes the bilingual "student" model using a task-tuned variant of the original MMT as its "teacher". We evaluate this distillation technique in zero-shot cross-lingual transfer across a number of standard cross-lingual benchmarks. The key results indicate that the distilled models exhibit minimal degradation in target language performance relative to the base MMT despite being significantly smaller and faster. Furthermore, we find that they outperform multilingually distilled models such as DistilmBERT and MiniLMv2 while having a very modest training budget in comparison, even on a per-language basis. We also show that bilingual models distilled from MMTs greatly outperform bilingual models trained from scratch. Our code and models are available at <a class="link-external link-https" href="https://github.com/AlanAnsell/bistil" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the efficiency and performance problems of large - scale multilingual Transformers (MMTs) in cross - language transfer learning. Specifically, the authors focus on the following two main issues: 1. **Resource Consumption and Deployment Costs**: - Although MMTs can represent hundreds of languages, their extensive language coverage makes these models too large and expensive in practical applications. This not only increases the model size, inference time, energy consumption, and hardware costs, but also particularly affects the use in low - resource language communities, as these communities usually have limited computing resources. 2. **Performance Degradation Caused by Multilingualism**: - MMTs may experience negative interference when dealing with multiple languages, namely the "curse of multilinguality". This phenomenon will lead to a decline in the performance of the model in cross - language transfer tasks, especially more obvious on low - resource languages. To solve these problems, the authors propose a method named **BISTILLATION**, which extracts compressed language - specific models from MMTs through bilingual distillation. This method not only significantly reduces the number of model parameters and inference time, but also only slightly reduces the performance of the target language while maintaining the cross - language transfer ability. ### The Core Idea of BISTILLATION The BISTILLATION method mainly consists of two stages: 1. **General Bilingual Distillation**: - Use data in the source language and the target language to distill a smaller student model, so that it retains the cross - language transfer ability of the original MMT. 2. **Task - Specific Distillation**: - On the basis of the first stage, further perform task - specific fine - tuning on the student model to adapt to specific downstream tasks. In this stage, a task - adjusted teacher model is used as guidance. In this way, the BISTILLATION method can significantly reduce the model scale and inference time while maintaining or even improving the performance of cross - language transfer tasks. In addition, the experimental results show that this method outperforms existing multilingual distillation models (such as DistilmBERT and MiniLMv2) on multiple benchmark datasets and is more economical in terms of training resource requirements. ### Summary The core problem of the paper is to improve the time and space efficiency of MMTs in cross - language transfer tasks while maintaining their performance. The BISTILLATION method achieves this goal through bilingual distillation, providing a more efficient and economical solution for the wide application of NLP systems, especially for low - resource language communities.

Distilling Efficient Language-Specific Models for Cross-Lingual Transfer

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.

Distilling a Pretrained Language Model to a Multilingual ASR Model

XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation

XtremeDistil: Multi-stage Distillation for Massive Multilingual Models

Towards Non-task-specific Distillation of BERT via Sentence Representation Approximation

Distillation for Multilingual Information Retrieval

Mandarin Text-to-Speech Front-End with Lightweight Distilled Convolution Network

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

MiniALBERT: Model Distillation via Parameter-Efficient Recursive Transformers

Load What You Need: Smaller Versions of Multilingual BERT

Multi-stage Distillation Framework for Cross-Lingual Semantic Similarity Matching

LLAVADI: What Matters For Multimodal Large Language Models Distillation

Towards Effective Utilization of Pre-trained Language Models

Task-agnostic Distillation of Encoder-Decoder Language Models

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Extremely Small BERT Models from Mixed-Vocabulary Training

A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models

Extract then Distill: Efficient and Effective Task-Agnostic BERT Distillation

Distilling Large Language Models for Efficient Clinical Information Extraction

GenDistiller: Distilling Pre-trained Language Models based on an Autoregressive Generative Model