MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions

Abdullatif Köksal,Marion Thaler,Ayyoob Imani,Ahmet Üstün,Anna Korhonen,Hinrich Schütze

2024-09-20

Abstract:Instruction tuning enhances large language models (LLMs) by aligning them with human preferences across diverse tasks. Traditional approaches to create instruction tuning datasets face serious challenges for low-resource languages due to their dependence on data annotation. This work introduces a novel method, Multilingual Reverse Instructions (MURI), which generates high-quality instruction tuning datasets for low-resource languages without requiring human annotators or pre-existing multilingual models. Utilizing reverse instructions and a translation pipeline, MURI produces instruction-output pairs from existing human-written texts in low-resource languages. This method ensures cultural relevance and diversity by sourcing texts from different native domains and applying filters to eliminate inappropriate content. Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages. Evaluation by native speakers and fine-tuning experiments with mT5 models demonstrate the approach's effectiveness for both NLU and open-ended generation. We publicly release datasets and models at <a class="link-external link-https" href="https://github.com/akoksal/muri" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The paper aims to address the issue of creating instruction-tuning datasets for low-resource languages. Specifically, current methods face the following challenges when creating instruction-tuning datasets: 1. **High cost of manual annotation**: For low-resource languages, finding enough native annotators is very difficult and costly. 2. **Many limitations of templated tasks**: This method generates datasets that are usually limited to specific structures and domains, have poor generality, and lack task annotation data for low-resource languages. 3. **Limited synthetic data generation**: Existing models support a limited number of languages, the generated data may have authenticity issues, and lack creativity. To solve these problems, the authors propose a new method—**Multilingual Reverse Instructions (MURI)**. MURI can generate high-quality instruction-tuning datasets without the need for manual annotation, task annotation data, or pre-trained multilingual models. Through translation pipelines and reverse instruction generation techniques, MURI can extract instruction-output pairs from existing texts, ensuring cultural relevance and diversity. Ultimately, they created a dataset called MURI-IT, which contains over 2 million instruction-output pairs in more than 200 languages, and validated its effectiveness through evaluations and experiments.

MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions

Linguistically-Informed Multilingual Instruction Tuning: Is There an Optimal Set of Languages to Tune?

Multilingual Instruction Tuning With Just a Pinch of Multilinguality

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

Instruction Tuning for Large Language Models: A Survey

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor

LongForm: Effective Instruction Tuning with Reverse Instructions

INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning

From Base to Conversational: Japanese Instruction Dataset and Tuning Large Language Models

Mosaic-IT: Free Compositional Data Augmentation Improves Instruction Tuning

Towards Robust Instruction Tuning on Multimodal Large Language Models

Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions?

M$^3$IT: A Large-Scale Dataset Towards Multi-Modal Multilingual Instruction Tuning

X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual Instructions

Improving Multilingual Instruction Finetuning via Linguistically Natural and Diverse Datasets

Exploring the Effectiveness of Instruction Tuning in Biomedical Language Processing

Aligning Large Multi-Modal Model with Robust Instruction Tuning

Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth Pace

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

CoachLM: Automatic Instruction Revisions Improve the Data Quality in LLM Instruction Tuning