MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

Tomasz Limisiewicz,Terra Blevins,Hila Gonen,Orevaoghene Ahia,Luke Zettlemoyer
2024-03-16
Abstract:A major consideration in multilingual language modeling is how to best represent languages with diverse vocabularies and scripts. Although contemporary text encoding methods cover most of the world's writing systems, they exhibit bias towards the high-resource languages of the Global West. As a result, texts of underrepresented languages tend to be segmented into long sequences of linguistically meaningless units. To address the disparities, we introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages. Our encoding convention (MYTE) is based on morphemes, as their inventories are more balanced across languages than characters, which are used in previous methods. We show that MYTE produces shorter encodings for all 99 analyzed languages, with the most notable improvements for non-European languages and non-Latin scripts. This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of text representation in multilingual language models, particularly how to better represent languages with different vocabularies and writing systems. The main focus includes: 1. **Bias in Existing Encoding Methods**: Current text encoding methods (such as UTF-8) support most writing systems but are biased towards high-resource Western languages, resulting in low-resource or non-Latin languages being split into overly long and meaningless units. 2. **Overly Long Encoding Sequences**: For many non-Latin languages, UTF-8 encoding often produces excessively long byte sequences, which not only increases the cost of model training and inference but also reduces sample efficiency. 3. **Fairness Issues**: Significant differences in encoding lengths between different languages affect the performance of multilingual models and disadvantage specific language users in certain APIs (such as ChatGPT) when it comes to billing. To address these issues, the authors propose a new encoding method called MYTE (Morphology-Driven Byte Encoding), which achieves fair representation across languages and writing systems through morphology-based byte encoding. Specifically, MYTE improves text representation in the following ways: - **Morphology-Driven Encoding**: Replaces the current character encoding methods with a morphology-based approach, as morphemes are more information-comparable than characters. - **Balanced Segmentation Granularity**: Ensures more consistent encoding lengths across different languages and writing systems, thereby improving the performance of multilingual language models and reducing computational costs. - **Shorter Encoding Sequences**: Experiments show that MYTE can significantly shorten the encoding lengths of all analyzed languages, with particularly notable performance in non-Latin languages. In summary, by proposing the MYTE encoding scheme, this paper aims to achieve a fairer and more efficient method for multilingual text representation.