MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China

Chen Zhang,Mingxu Tao,Quzhe Huang,Jiuheng Lin,Zhibin Chen,Yansong Feng

2024-06-13

Abstract:Current large language models demonstrate deficiencies in understanding low-resource languages, particularly the minority languages in China. This limitation stems from the scarcity of available pre-training data. To address this accessibility challenge, we present MC$^2$, a Multilingual Corpus of Minority Languages in China, which is the largest open-source corpus of its kind so far. MC$^2$ includes four underrepresented languages: Tibetan, Uyghur, Kazakh, and Mongolian. Notably, we focus on the less common writing systems of Kazakh and Mongolian, i.e., Kazakh Arabic script and traditional Mongolian script, respectively, which have been long neglected in previous corpus construction efforts. Recognizing the prevalence of language contamination within existing corpora, we adopt a quality-centric solution for collecting MC$^2$, prioritizing accuracy while enhancing diversity. Furthermore, we underscore the importance of attending to the multiplicity of writing systems, which is closely related to the cultural awareness of the resulting models. The MC$^2$ corpus and related models are made public to the community.

Computation and Language

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the shortcomings of large language models in understanding low-resource languages, particularly the minority languages in China. Specifically, the paper focuses on the following points: 1. **Data Scarcity**: Current large-scale language models primarily rely on data from high-resource languages during pre-training, leading to a severe lack of data for minority languages. 2. **Corpus Quality Issues**: Existing multilingual corpora have significant quality issues when dealing with low-resource languages, including language identification errors and insufficient data cleaning. 3. **Diversity of Writing Systems**: For the same language that uses different writing systems (such as Kazakh and Mongolian), existing datasets usually focus only on the more common writing systems, neglecting the less common ones. To address these issues, the authors propose MC² (Multilingual Corpus of Minority Languages in China), which is the largest open-source corpus of minority languages to date. This corpus covers four minority languages: Tibetan, Uyghur, Kazakh, and Mongolian, with a particular focus on the less common writing systems of Kazakh Arabic script and traditional Mongolian script. Additionally, the paper explores the technical challenges and cultural differences between different writing systems, emphasizing the importance of collecting high-quality and diverse corpora.

MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China

MC2: Towards Transparent and Culturally-Aware NLP for Minority Languages in China

Multimedia Simultaneous Translation System for Minority Language Communication with Mandarin

CINO: A Chinese Minority Pre-trained Language Model

Taiwan LLM: Bridging the Linguistic Divide with a Culturally Aligned Language Model

MiLMo:Minority Multilingual Pre-trained Language Model

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

Extending the MPC Corpus to Chinese and Urdu - A Multiparty Multi-Lingual Chat Corpus for Modeling Social Phenomena in Language.

Baichuan 2: Open Large-scale Language Models

A Multilingual Language Processing Tool for Uyghur, Kazak and Kirghiz

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data

Responsible Multilingual Large Language Models: A Survey of Development, Applications, and Societal Impact

Towards Language-Universal Mandarin-English Speech Recognition

YAYI 2: Multilingual Open-Source Large Language Models

Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario

Benchmarking Machine Translation with Cultural Awareness

CMMLU: Measuring massive multitask language understanding in Chinese

Towards Building Multilingual Language Model for Medicine

Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers

CultureBank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies