Abstract:Large Language Models have proven highly successful at modelling a variety of tasks. However, this comes at a steep computational cost that hinders wider industrial uptake. In this paper, we present MWT: a Multi-Word Tokenizer that goes beyond word boundaries by representing frequent multi-word expressions as single tokens. MWTs produce a more compact and efficient tokenization that yields two benefits: (1) Increase in performance due to a greater coverage of input data given a fixed sequence length budget; (2) Faster and lighter inference due to the ability to reduce the sequence length with negligible drops in performance. Our results show that MWT is more robust across shorter sequence lengths, thus allowing for major speedups via early sequence truncation.

What problem does this paper attempt to address?

This paper mainly discusses how to compress the input sequence of large language models (LLMs) by using multi-word tokenization (MWT) to reduce computation cost and improve inference speed. Traditional model compression methods such as knowledge distillation, pruning, and quantization mainly focus on reducing the size of the model architecture or decreasing the number of floating-point operations. However, for LLMs, research has found that the length of the input sequence is also an important factor affecting computation cost. The paper proposes MWT, a method that goes beyond word boundaries by representing frequent multi-word expressions as a single token, thus achieving more compact and efficient tokenization. This allows for more information to be included within a fixed sequence length and speeds up inference by early sequence truncation. Experiments show that MWT is compatible with existing traditional compression methods and maintains good performance when combined with distillation models. In the paper, the researchers evaluate MWT on three text classification datasets, including tasks in the medical, legal, and technical domains. The results show that MWT can significantly shorten sequence length, improve model processing speed, and have minimal impact on performance. In some cases, even with a reduction of half the sequence length, the performance hardly decreases. Furthermore, combining MWT with domain-specific tokenizers, especially in medical and legal domains, can achieve further improvements in inference speed. The paper also discusses the combination of MWT with knowledge distillation and finds that while maintaining similar performance, further improvements in inference speed can be achieved. Although the effectiveness of MWT may not be as significant on certain specific tasks (such as patent classification), overall, it provides an effective method to reduce computation cost for LLMs.

Multi-word Tokenization for Sequence Compression

MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression

Learn Your Tokens: Word-Pooled Tokenization for Language Modeling

An Efficient Multilingual Language Model Compression through Vocabulary Trimming

Efficient Time Series Processing for Transformers and State-Space Models through Token Merging

TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction

Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models

Extending Context Window of Large Language Models via Semantic Compression

Efficient Multi-modal Large Language Models via Visual Token Grouping

Tokenization Is More Than Compression

Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance

Retrofitting (Large) Language Models with Dynamic Tokenization

TensorGPT: Efficient Compression of Large Language Models based on Tensor-Train Decomposition

Joint Goal for Word Embedding Compression Based on Word Frequency

Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding

Fast Vocabulary Transfer for Language Model Compression

Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models

Better & Faster Large Language Models via Multi-token Prediction

ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model

West: Word Encoded Sequence Transducers

Tokenization Falling Short: On Subword Robustness in Large Language Models