MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions

Abdullatif Köksal,Marion Thaler,Ayyoob Imani,Ahmet Üstün,Anna Korhonen,Hinrich Schütze
2024-09-20
Abstract:Instruction tuning enhances large language models (LLMs) by aligning them with human preferences across diverse tasks. Traditional approaches to create instruction tuning datasets face serious challenges for low-resource languages due to their dependence on data annotation. This work introduces a novel method, Multilingual Reverse Instructions (MURI), which generates high-quality instruction tuning datasets for low-resource languages without requiring human annotators or pre-existing multilingual models. Utilizing reverse instructions and a translation pipeline, MURI produces instruction-output pairs from existing human-written texts in low-resource languages. This method ensures cultural relevance and diversity by sourcing texts from different native domains and applying filters to eliminate inappropriate content. Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages. Evaluation by native speakers and fine-tuning experiments with mT5 models demonstrate the approach's effectiveness for both NLU and open-ended generation. We publicly release datasets and models at <a class="link-external link-https" href="https://github.com/akoksal/muri" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the issue of creating instruction-tuning datasets for low-resource languages. Specifically, current methods face the following challenges when creating instruction-tuning datasets: 1. **High cost of manual annotation**: For low-resource languages, finding enough native annotators is very difficult and costly. 2. **Many limitations of templated tasks**: This method generates datasets that are usually limited to specific structures and domains, have poor generality, and lack task annotation data for low-resource languages. 3. **Limited synthetic data generation**: Existing models support a limited number of languages, the generated data may have authenticity issues, and lack creativity. To solve these problems, the authors propose a new method—**Multilingual Reverse Instructions (MURI)**. MURI can generate high-quality instruction-tuning datasets without the need for manual annotation, task annotation data, or pre-trained multilingual models. Through translation pipelines and reverse instruction generation techniques, MURI can extract instruction-output pairs from existing texts, ensuring cultural relevance and diversity. Ultimately, they created a dataset called MURI-IT, which contains over 2 million instruction-output pairs in more than 200 languages, and validated its effectiveness through evaluations and experiments.