Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages

Joanito Agili Lopo,Radius Tanone

2024-04-01

Abstract:In Indonesia, local languages play an integral role in the culture. However, the available Indonesian language resources still fall into the category of limited data in the Natural Language Processing (NLP) field. This is become problematic when build NLP model for these languages. To address this gap, we introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages. Our goal is to enhance access and utilization of these resources, extending their reach within the country. We explained in a detail the dataset collection process and associated challenges. Additionally, we experimented with translation task using the IBM Model 1 due to data constraints. The result showed that the performance of each language already shows good indications for further development. Challenges such as lexical variation, smoothing effects, and cross-linguistic variability are discussed. We intend to evaluate the corpus using advanced NLP techniques for low-resource languages, paving the way for multilingual translation models.

Computation and Language

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of resource scarcity for local languages in Indonesia within the field of Natural Language Processing (NLP). Specifically: 1. **Data Scarcity**: Indonesia has a rich linguistic diversity, but the available language resources are still limited, especially in the field of NLP. This makes it difficult to build effective NLP models for these languages. 2. **Construction of Multilingual Parallel Corpora**: To address this shortfall, the authors introduce the "Bhinneka Korpus," a multilingual parallel corpus that includes five local languages of Indonesia. 3. **Research on Low-Resource Languages**: The paper particularly focuses on the languages of Central and Eastern Indonesia and provides the first bilingual dictionary for the under-documented local language (Beaye) of West Kalimantan. 4. **Challenges in Data Collection**: It details the process of collecting datasets for low-resource languages and the associated challenges, such as lexical variation, smoothing effects, and cross-linguistic variability. Through these efforts, the authors hope to enhance access to and utilization of these language resources, thereby promoting the development of multilingual translation models.

Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages

NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages

NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

Replicable Benchmarking of Neural Machine Translation (NMT) on Low-Resource Local Languages in Indonesia

IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding

IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation

DriveThru: a Document Extraction Platform and Benchmark Datasets for Indonesian Local Language Archives

Building Dialogue Understanding Models for Low-resource Language Indonesian from Scratch

IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local Languages

Location-based Twitter Filtering for the Creation of Low-Resource Language Datasets in Indonesian Local Languages

NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian Languages

NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural

COPAL-ID: Indonesian Language Reasoning with Local Culture and Nuances

IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP

IndoNLI: A Natural Language Inference Dataset for Indonesian

IndoToD: A Multi-Domain Indonesian Benchmark For End-to-End Task-Oriented Dialogue Systems

Komodo: A Linguistic Expedition into Indonesia's Regional Languages

First Attempt at Building Parallel Corpora for Machine Translation of Northeast India's Very Low-Resource Languages

Utilizing Weak Supervision To Generate Indonesian Conservation Dataset

Cross-Lingual Transfer for Distantly Supervised and Low-resources Indonesian NER