Abstract:In the development of Large Language Models (LLMs), considerable attention has been given to the quality of training datasets. However, the role of tokenizers in the LLM training pipeline, particularly for multilingual models, has received less focus. The quality of tokenization can significantly impact a model's ability to handle diverse languages effectively. We introduce Qtok, a tool designed to assess tokenizer quality with a specific emphasis on their performance in multilingual contexts. Our research proposes a set of metrics for evaluating tokenizer quality, including measures of language coverage, token completeness, and distribution across languages and linguistic categories. Qtok applies these metrics to evaluate 13 distinct tokenizers from 58 publicly available models, analyzing their output across different linguistic contexts. Our analysis revealed significant variations in token distribution across languages and categories, highlighting potential biases and areas for improvement in current tokenization strategies. This research contributes to the field of tokenizer evaluation within multilingual LLM development by providing a systematic approach to assessing tokenizer quality. Our findings highlight the critical role of tokenization in multilingual LLM capability. The Qtok tool and our analysis methodology offer practical means for researchers to evaluate and improve tokenization strategies for multilingual applications. We offer a method to compare tokenizer quality across these metrics, which may be useful when selecting or adjusting tokenizers for specific multilingual LLM applications.

A Statistical Method for Uyghur Tokenization

Uyghur-Chinese statistical machine translation by incorporating morphological information

Polygon-Location Method Based on Uyghur Text Regional Rules

Error Analysis of Uyghur Name Tagging: Language-specific Techniques and Remaining Challenges.

Text Filtering through Multi-Pattern Matching: A Case Study of Wu–Manber–Uy on the Language of Uyghur

Design and implementation of prototype system for online handwritten Uyghur character recognition

Man-Machine Speech Communication

Multi-font Multi-Size Printed Uyghur Character Recognition

A Multilingual Language Processing Tool for Uyghur, Kazak and Kirghiz

Learning Distributed Representations Of Uyghur Words And Morphemes

Leveraging Phone Mask Training for Phonetic-Reduction-Robust E2E Uyghur Speech Recognition

Improving Uyghur ASR systems with decoders using morpheme-based language models

Word Level Script Recognition for Uighur Document Mixed with English Script.

Uyghur Morphological Segmentation with Bidirectional GRU Neural Networks

The Foundations of Tokenization: Statistical and Computational Concerns

An open/free database and Benchmark for Uyghur speaker recognition

Exploring the Benefits of Tokenization of Discrete Acoustic Units

To token or not to token: A Comparative Study of Text Representations for Cross-Lingual Transfer

Uyghur Character Models with Shared Structure Information for Segmentation-free Recognition under Low Data Resource Conditions

Qtok: A Comprehensive Framework for Evaluating Multilingual Tokenizer Quality in Large Language Models

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS