Abstract:The recent success of Large Language Models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model's downstream performance and training costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model's downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-centric tokenizers have been applied to the training of multi-lingual LLMs in the past, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary.

What problem does this paper attempt to address?

The paper primarily discusses the role of tokenizers in the training of large language models (LLMs) and their impact on the performance of downstream tasks of the model. Specifically, the paper attempts to address the following key issues: 1. **The Importance of Tokenizer Choice**: Investigating the specific impact of different tokenization algorithms and parameter settings on the performance of LLM downstream tasks. 2. **The Correlation Between Intrinsic and Extrinsic Evaluation**: Exploring the relationship between intrinsic quality metrics of tokenizers (such as fertility and evenness) and their impact on the performance of model downstream tasks. 3. **Challenges in Multilingual Contexts**: Analyzing the potential issues when applying English-centric tokenizers to multilingual LLMs and discussing the need to train tokenizers on multilingual datasets. By training 24 different-sized LLMs (with 260 million parameters) that include both monolingual and multilingual varieties, the authors conducted a comprehensive study to assess the impact of different tokenizer algorithms, parameter configurations, and vocabulary sizes on model performance. The research found that the choice of tokenizer significantly affects the downstream performance of the model, and common tokenizer evaluation metrics do not always accurately predict the downstream performance of the model. Additionally, the paper highlights the issues that using English-specific tokenizers in multilingual contexts may lead to performance degradation and increased additional training costs. Overall, the paper aims to fill the current research gap regarding the impact of tokenizers on external model performance and provides important insights into the selection and optimization of tokenizers.

Tokenizer Choice For LLM Training: Negligible or Crucial?

Getting the most out of your tokenizer for pre-training and domain adaptation

Towards Linguistically-Aware and Language-Independent Tokenization for Large Language Models (LLMs)

Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization

Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough

Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models

From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages

Tokenization Falling Short: On Subword Robustness in Large Language Models

Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability

Bridging the Gap for Tokenizer-Free Language Models

What Makes for Good Visual Tokenizers for Large Language Models?

Retrofitting (Large) Language Models with Dynamic Tokenization

Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs

ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model

Exploring Design Choices for Building Language-Specific LLMs

Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization

Beyond Metrics: Evaluating LLMs' Effectiveness in Culturally Nuanced, Low-Resource Real-World Scenarios

Where is the signal in tokenization space?

Qtok: A Comprehensive Framework for Evaluating Multilingual Tokenizer Quality in Large Language Models

Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages