Fast Vocabulary Transfer for Language Model Compression

Leonidas Gee,Andrea Zugarini,Leonardo Rigutini,Paolo Torroni
DOI: https://doi.org/10.18653/v1/2022.emnlp-industry.41
2024-02-15
Abstract:Real-world business applications require a trade-off between language model performance and size. We propose a new method for model compression that relies on vocabulary transfer. We evaluate the method on various vertical domains and downstream tasks. Our results indicate that vocabulary transfer can be effectively used in combination with other compression techniques, yielding a significant reduction in model size and inference time while marginally compromising on performance.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper primarily addresses the issues of high computational cost, large memory usage, and long inference time faced by large-scale pre-trained language models (Language Models, LM) in practical applications. It proposes a new compression method called Vocabulary Transfer (VT). Specifically, the goal of the paper is to reduce the model size and accelerate the inference process by compressing the model's vocabulary, while minimizing the performance degradation. The main contributions of the paper can be summarized as follows: 1. **Proposing the Vocabulary Transfer technique**: By training a custom tokenizer for downstream task domains to adapt to the language distribution of specific vertical domains or topics, the vocabulary size is reduced. This method can shorten the sequence length, thereby reducing the computational complexity of the attention layer and further improving the model's operational efficiency. 2. **Fast Vocabulary Transfer (FVT)**: To transfer knowledge from a general pre-trained model to a smaller model in a specific domain, the FVT algorithm is proposed to initialize the embedding representations of the new vocabulary. FVT transfers information from the general model to the domain-specific model through a simple and effective method. 3. **Combining with Knowledge Distillation**: The paper demonstrates that the VT technique can be combined with other model compression techniques such as Knowledge Distillation (KD) to further reduce the model size and accelerate the inference process. Experiments show that this combination can significantly improve compression rates and speed up performance, with only a slight performance loss. In summary, this research aims to explore how to effectively reduce the size of pre-trained language models and improve their operational efficiency while maintaining high performance, which is of great significance for practical business applications.