Abstract:Large Language Models have proven highly successful at modelling a variety of tasks. However, this comes at a steep computational cost that hinders wider industrial uptake. In this paper, we present MWT: a Multi-Word Tokenizer that goes beyond word boundaries by representing frequent multi-word expressions as single tokens. MWTs produce a more compact and efficient tokenization that yields two benefits: (1) Increase in performance due to a greater coverage of input data given a fixed sequence length budget; (2) Faster and lighter inference due to the ability to reduce the sequence length with negligible drops in performance. Our results show that MWT is more robust across shorter sequence lengths, thus allowing for major speedups via early sequence truncation.
What problem does this paper attempt to address?
This paper mainly discusses how to compress the input sequence of large language models (LLMs) by using multi-word tokenization (MWT) to reduce computation cost and improve inference speed. Traditional model compression methods such as knowledge distillation, pruning, and quantization mainly focus on reducing the size of the model architecture or decreasing the number of floating-point operations. However, for LLMs, research has found that the length of the input sequence is also an important factor affecting computation cost.
The paper proposes MWT, a method that goes beyond word boundaries by representing frequent multi-word expressions as a single token, thus achieving more compact and efficient tokenization. This allows for more information to be included within a fixed sequence length and speeds up inference by early sequence truncation. Experiments show that MWT is compatible with existing traditional compression methods and maintains good performance when combined with distillation models.
In the paper, the researchers evaluate MWT on three text classification datasets, including tasks in the medical, legal, and technical domains. The results show that MWT can significantly shorten sequence length, improve model processing speed, and have minimal impact on performance. In some cases, even with a reduction of half the sequence length, the performance hardly decreases. Furthermore, combining MWT with domain-specific tokenizers, especially in medical and legal domains, can achieve further improvements in inference speed.
The paper also discusses the combination of MWT with knowledge distillation and finds that while maintaining similar performance, further improvements in inference speed can be achieved. Although the effectiveness of MWT may not be as significant on certain specific tasks (such as patent classification), overall, it provides an effective method to reduce computation cost for LLMs.