Tokenizer Choice For LLM Training: Negligible or Crucial?

Mehdi Ali,Michael Fromm,Klaudia Thellmann,Richard Rutmann,Max Lübbering,Johannes Leveling,Katrin Klug,Jan Ebert,Niclas Doll,Jasper Schulze Buschhoff,Charvi Jain,Alexander Arno Weber,Lena Jurkschat,Hammam Abdelwahab,Chelsea John,Pedro Ortiz Suarez,Malte Ostendorff,Samuel Weinbach,Rafet Sifa,Stefan Kesselheim,Nicolas Flores-Herr
2024-03-17
Abstract:The recent success of Large Language Models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model's downstream performance and training costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model's downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-centric tokenizers have been applied to the training of multi-lingual LLMs in the past, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary.
Machine Learning
What problem does this paper attempt to address?
The paper primarily discusses the role of tokenizers in the training of large language models (LLMs) and their impact on the performance of downstream tasks of the model. Specifically, the paper attempts to address the following key issues: 1. **The Importance of Tokenizer Choice**: Investigating the specific impact of different tokenization algorithms and parameter settings on the performance of LLM downstream tasks. 2. **The Correlation Between Intrinsic and Extrinsic Evaluation**: Exploring the relationship between intrinsic quality metrics of tokenizers (such as fertility and evenness) and their impact on the performance of model downstream tasks. 3. **Challenges in Multilingual Contexts**: Analyzing the potential issues when applying English-centric tokenizers to multilingual LLMs and discussing the need to train tokenizers on multilingual datasets. By training 24 different-sized LLMs (with 260 million parameters) that include both monolingual and multilingual varieties, the authors conducted a comprehensive study to assess the impact of different tokenizer algorithms, parameter configurations, and vocabulary sizes on model performance. The research found that the choice of tokenizer significantly affects the downstream performance of the model, and common tokenizer evaluation metrics do not always accurately predict the downstream performance of the model. Additionally, the paper highlights the issues that using English-specific tokenizers in multilingual contexts may lead to performance degradation and increased additional training costs. Overall, the paper aims to fill the current research gap regarding the impact of tokenizers on external model performance and provides important insights into the selection and optimization of tokenizers.