Abstract:A major consideration in multilingual language modeling is how to best represent languages with diverse vocabularies and scripts. Although contemporary text encoding methods cover most of the world's writing systems, they exhibit bias towards the high-resource languages of the Global West. As a result, texts of underrepresented languages tend to be segmented into long sequences of linguistically meaningless units. To address the disparities, we introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages. Our encoding convention (MYTE) is based on morphemes, as their inventories are more balanced across languages than characters, which are used in previous methods. We show that MYTE produces shorter encodings for all 99 analyzed languages, with the most notable improvements for non-European languages and non-Latin scripts. This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of text representation in multilingual language models, particularly how to better represent languages with different vocabularies and writing systems. The main focus includes: 1. **Bias in Existing Encoding Methods**: Current text encoding methods (such as UTF-8) support most writing systems but are biased towards high-resource Western languages, resulting in low-resource or non-Latin languages being split into overly long and meaningless units. 2. **Overly Long Encoding Sequences**: For many non-Latin languages, UTF-8 encoding often produces excessively long byte sequences, which not only increases the cost of model training and inference but also reduces sample efficiency. 3. **Fairness Issues**: Significant differences in encoding lengths between different languages affect the performance of multilingual models and disadvantage specific language users in certain APIs (such as ChatGPT) when it comes to billing. To address these issues, the authors propose a new encoding method called MYTE (Morphology-Driven Byte Encoding), which achieves fair representation across languages and writing systems through morphology-based byte encoding. Specifically, MYTE improves text representation in the following ways: - **Morphology-Driven Encoding**: Replaces the current character encoding methods with a morphology-based approach, as morphemes are more information-comparable than characters. - **Balanced Segmentation Granularity**: Ensures more consistent encoding lengths across different languages and writing systems, thereby improving the performance of multilingual language models and reducing computational costs. - **Shorter Encoding Sequences**: Experiments show that MYTE can significantly shorten the encoding lengths of all analyzed languages, with particularly notable performance in non-Latin languages. In summary, by proposing the MYTE encoding scheme, this paper aims to achieve a fairer and more efficient method for multilingual text representation.

MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

MorphTE: Injecting Morphology in Tensorized Embeddings

Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5

Language-agnostic Multilingual Modeling

TAMS: Translation-Assisted Morphological Segmentation

Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation

Character-Aware Decoder for Translation into Morphologically Rich Languages

MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization

Language Modeling for Morphologically Rich Languages: Character-Aware Modeling for Word-Level Prediction

Better Character Language Modeling Through Morphology

Morphological and Language-Agnostic Word Segmentation for NMT

EMMeTT: Efficient Multimodal Machine Translation Training

Improving Character-based Decoding Using Target-Side Morphological Information for Neural Machine Translation

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

Tomato, Tomahto, Tomate: Measuring the Role of Shared Semantics among Subwords in Multilingual Language Models

Auto-Encoding Morph-Tokens for Multimodal LLM

Word Representation Models for Morphologically Rich Languages in Neural Machine Translation

Modeling Target-Side Inflection in Neural Machine Translation

MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting

Morphological Analysis for Statistical Machine Translation

An Efficient Multilingual Language Model Compression through Vocabulary Trimming