Abstract:Large Language Models (LLMs) based on transformer architectures have revolutionized a variety of domains, with tokenization playing a pivotal role in their pre-processing and fine-tuning stages. In multilingual models, particularly those tailored for Indic languages, effective tokenization is crucial for optimizing performance. This paper presents a comprehensive evaluation of tokenizers used by 12 LLMs across all 22 official languages of India, with a focus on comparing the efficiency of their tokenization processes. We employed the Normalized Sequence Length (NSL) as a key metric in our analysis. Our findings reveal that the SUTRA tokenizer outperforms all other models, including several Indic-specific models, excelling in 14 languages. Notable insights include the SUTRA tokenizer's superior handling of Indic languages, GPT-4o's advancement over its predecessor GPT-4 in processing Indian languages, and the limited performance of Project Indus in certain languages. This study underscores the critical importance of developing targeted tokenization strategies for multilingual and Indic-centric models, laying the groundwork for future improvements in tokenizer design to enhance linguistic coverage and model efficiency.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the performance of tokenizers in large language models (LLMs) for official languages in India. Specifically, the research focuses on comparing the tokenization efficiency of 12 large language models when dealing with 22 official languages in India. By using the normalized sequence length (NSL) as a key evaluation metric, the study aims to reveal the performance differences of different tokenizers when handling these languages and provide improvement suggestions for future tokenizer design to enhance the performance of multilingual and Hindi - centered models. ### Main research questions: 1. **Tokenizer performance evaluation**: Evaluate the performance of tokenizers in 12 large language models for 22 official languages in India. 2. **Tokenizer efficiency comparison**: Use the normalized sequence length (NSL) as the main evaluation metric to compare the efficiency of different tokenizers. 3. **Specific model performance**: Analyze which tokenizers perform best in which languages, especially the performance of the SUTRA tokenizer. 4. **Improvement directions**: Explore how to develop more effective tokenization strategies for multilingual and Hindi - centered models. ### Research background: - **Large language models (LLMs)**: LLMs based on the Transformer architecture have made significant progress in multiple fields, and tokenization plays a crucial role in the pre - processing and fine - tuning stages. - **Multilingual models**: Especially for multilingual models targeting Hindi, effective tokenization is essential for optimizing performance. - **Evaluation metrics**: The normalized sequence length (NSL) is used as a key metric for evaluating the efficiency of tokenizers. ### Methodology: 1. **Example texts**: Collected example texts in 22 official languages in India. The text of each language is written in its main writing script to ensure an accurate evaluation of the tokenizer's ability to handle native scripts. 2. **Model selection**: Selected 12 models, including proprietary multilingual models and open - weight multilingual and Hindi models. 3. **Evaluation metrics**: Use the normalized sequence length (NSL) as an evaluation metric. The calculation formula is as follows: \[ c_{\lambda\beta} = \frac{\sum_{i = 1}^{N}\text{length}(T_\lambda(D_i))}{\sum_{i = 1}^{N}\text{length}(T_\beta(D_i))} \] where \( T_\lambda \) and \( T_\beta \) represent two different tokenizers respectively, \( D_i \) represents the \( i \) - th example text, and \( N \) is the number of example texts. ### Results: - **SUTRA tokenizer**: Performs best in 14 languages, showing its superiority in handling Hindi. - **GPT - 4o**: Has a significant improvement in handling Indian languages compared to its predecessor GPT - 4. - **Project Indus**: Has limited performance in some languages, especially when dealing with languages that are not in the Devanagari script. ### Discussion: - **Multilingual ability**: The SUTRA tokenizer demonstrates strong multilingual processing ability, especially in Hindi. - **Tokenizer efficiency**: An efficient tokenizer can reduce the demand for computing resources and improve the training speed and overall performance of the model. - **Future directions**: Future research can focus on improving tokenizers to better handle languages with complex scripts or large dialect variations, thereby enhancing the model performance of high - resource and low - resource languages. ### Practical applications and future directions: - **Multilingual model development**: The research results are of great significance for the development of multilingual models across Indian languages. - **Tokenizer optimization**: Future research can further optimize tokenizers to improve their efficiency in handling complex language structures and multilingual contexts. Through these studies, the paper provides important references and improvement suggestions for the design of tokenizers in multilingual and Hindi - centered models.

Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages

Performance Evaluation of Tokenizers in Large Language Models for the Assamese Language

Indic-Transformers: An Analysis of Transformer Language Models for Indian Languages

Pretraining Data and Tokenizer for Indic LLM

Tokenizer Choice For LLM Training: Negligible or Crucial?

Getting the most out of your tokenizer for pre-training and domain adaptation

Towards Linguistically-Aware and Language-Independent Tokenization for Large Language Models (LLMs)

IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?

Unraveling the Dominance of Large Language Models Over Transformer Models for Bangla Natural Language Inference: A Comprehensive Study

Performance of Recent Large Language Models for a Low-Resourced Language

Bridging the Gap for Tokenizer-Free Language Models

When Every Token Counts: Optimal Segmentation for Low-Resource Language Models

Do Large Language Models Speak All Languages Equally? A Comparative Study in Low-Resource Settings

MILU: A Multi-task Indic Language Understanding Benchmark

Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models

MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model

Assessing Translation capabilities of Large Language Models involving English and Indian Languages

Retrofitting (Large) Language Models with Dynamic Tokenization

Can Perplexity Predict Fine-Tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali

SUTRA: Scalable Multilingual Language Model Architecture