Fast Vocabulary Transfer for Language Model Compression

Leonidas Gee,Andrea Zugarini,Leonardo Rigutini,Paolo Torroni

DOI: https://doi.org/10.18653/v1/2022.emnlp-industry.41

2024-02-15

Abstract:Real-world business applications require a trade-off between language model performance and size. We propose a new method for model compression that relies on vocabulary transfer. We evaluate the method on various vertical domains and downstream tasks. Our results indicate that vocabulary transfer can be effectively used in combination with other compression techniques, yielding a significant reduction in model size and inference time while marginally compromising on performance.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The paper primarily addresses the issues of high computational cost, large memory usage, and long inference time faced by large-scale pre-trained language models (Language Models, LM) in practical applications. It proposes a new compression method called Vocabulary Transfer (VT). Specifically, the goal of the paper is to reduce the model size and accelerate the inference process by compressing the model's vocabulary, while minimizing the performance degradation. The main contributions of the paper can be summarized as follows: 1. **Proposing the Vocabulary Transfer technique**: By training a custom tokenizer for downstream task domains to adapt to the language distribution of specific vertical domains or topics, the vocabulary size is reduced. This method can shorten the sequence length, thereby reducing the computational complexity of the attention layer and further improving the model's operational efficiency. 2. **Fast Vocabulary Transfer (FVT)**: To transfer knowledge from a general pre-trained model to a smaller model in a specific domain, the FVT algorithm is proposed to initialize the embedding representations of the new vocabulary. FVT transfers information from the general model to the domain-specific model through a simple and effective method. 3. **Combining with Knowledge Distillation**: The paper demonstrates that the VT technique can be combined with other model compression techniques such as Knowledge Distillation (KD) to further reduce the model size and accelerate the inference process. Experiments show that this combination can significantly improve compression rates and speed up performance, with only a slight performance loss. In summary, this research aims to explore how to effectively reduce the size of pre-trained language models and improve their operational efficiency while maintaining high performance, which is of great significance for practical business applications.

Fast Vocabulary Transfer for Language Model Compression

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.

Language Modeling Is Compression

An Efficient Multilingual Language Model Compression through Vocabulary Trimming

Evaluating Large Language Models for Generalization and Robustness via Data Compression

Retrieval-based Knowledge Transfer: An Effective Approach for Extreme Large Language Model Compression

Extreme Model Compression for On-device Natural Language Understanding

Extending Context Window of Large Language Models via Semantic Compression

Model Compression and Efficient Inference for Large Language Models: A Survey

Adapting Language Models to Compress Contexts

Efficient Large Multi-modal Models via Visual Context Compression

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Variator: Accelerating Pre-trained Models with Plug-and-Play Compression Modules

Joint Goal for Word Embedding Compression Based on Word Frequency

Fast data-free model compression via dictionary-pair reconstruction

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration

VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models

LLM Vocabulary Compression for Low-Compute Environments

TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction

Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models