MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China

Chen Zhang,Mingxu Tao,Quzhe Huang,Jiuheng Lin,Zhibin Chen,Yansong Feng
2024-06-13
Abstract:Current large language models demonstrate deficiencies in understanding low-resource languages, particularly the minority languages in China. This limitation stems from the scarcity of available pre-training data. To address this accessibility challenge, we present MC$^2$, a Multilingual Corpus of Minority Languages in China, which is the largest open-source corpus of its kind so far. MC$^2$ includes four underrepresented languages: Tibetan, Uyghur, Kazakh, and Mongolian. Notably, we focus on the less common writing systems of Kazakh and Mongolian, i.e., Kazakh Arabic script and traditional Mongolian script, respectively, which have been long neglected in previous corpus construction efforts. Recognizing the prevalence of language contamination within existing corpora, we adopt a quality-centric solution for collecting MC$^2$, prioritizing accuracy while enhancing diversity. Furthermore, we underscore the importance of attending to the multiplicity of writing systems, which is closely related to the cultural awareness of the resulting models. The MC$^2$ corpus and related models are made public to the community.
Computation and Language
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the shortcomings of large language models in understanding low-resource languages, particularly the minority languages in China. Specifically, the paper focuses on the following points: 1. **Data Scarcity**: Current large-scale language models primarily rely on data from high-resource languages during pre-training, leading to a severe lack of data for minority languages. 2. **Corpus Quality Issues**: Existing multilingual corpora have significant quality issues when dealing with low-resource languages, including language identification errors and insufficient data cleaning. 3. **Diversity of Writing Systems**: For the same language that uses different writing systems (such as Kazakh and Mongolian), existing datasets usually focus only on the more common writing systems, neglecting the less common ones. To address these issues, the authors propose MC² (Multilingual Corpus of Minority Languages in China), which is the largest open-source corpus of minority languages to date. This corpus covers four minority languages: Tibetan, Uyghur, Kazakh, and Mongolian, with a particular focus on the less common writing systems of Kazakh Arabic script and traditional Mongolian script. Additionally, the paper explores the technical challenges and cultural differences between different writing systems, emphasizing the importance of collecting high-quality and diverse corpora.