X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual Instructions

Chong Li,Wen Yang,Jiajun Zhang,Jinliang Lu,Shaonan Wang,Chengqing Zong

2024-05-30

Abstract:Large language models respond well in high-resource languages like English but struggle in low-resource languages. It may arise from the lack of high-quality instruction following data in these languages. Directly translating English samples into these languages can be a solution but unreliable, leading to responses with translation errors and lacking language-specific or cultural knowledge. To address this issue, we propose a novel method to construct cross-lingual instruction following samples with instruction in English and response in low-resource languages. Specifically, the language model first learns to generate appropriate English instructions according to the natural web texts in other languages as responses. The candidate cross-lingual instruction tuning samples are further refined and diversified. We have employed this method to build a large-scale cross-lingual instruction tuning dataset on 10 languages, namely X-Instruction. The instruction data built using our method incorporate more language-specific knowledge compared with the naive translation method. Experimental results have shown that the response quality of the model tuned on X-Instruction greatly exceeds the model distilled from a powerful teacher model, reaching or even surpassing the ones of ChatGPT. In addition, we find that models tuned on cross-lingual instruction following samples can follow the instruction in the output language without further tuning.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges faced by large - scale language models in generating high - quality responses in low - resource languages. Specifically, these problems include: 1. **Lack of high - quality instruction - following data**: In low - resource languages, due to the lack of high - quality instruction - following data, the performance of the model on these languages is not as good as that on high - resource languages such as English. 2. **Unreliability of direct translation**: The method of directly translating English samples into low - resource languages is unreliable. It is easy to introduce translation errors and ignores the knowledge of specific languages or cultures. 3. **Insufficient generation performance**: In low - resource languages, the generation performance of the model is poor, especially when generating complex or domain - specific texts. To address these challenges, the paper proposes a new method to construct cross - language instruction - following samples, which include English instructions and responses in low - resource languages. Through this method, the response quality of the model in low - resource languages can be improved, and the model can better understand and generate texts in these languages.

X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual Instructions

xCoT: Cross-lingual Instruction Tuning for Cross-lingual Chain-of-Thought Reasoning

InstructAlign: High-and-Low Resource Language Alignment via Continual Crosslingual Instruction Tuning

CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment

From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning

Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions?

BayLing: Bridging Cross-lingual Alignment and Instruction Following through Interactive Translation for Large Language Models

Linguistically-Informed Multilingual Instruction Tuning: Is There an Optimal Set of Languages to Tune?

Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?

Instruction Tuning for Large Language Models: A Survey

Multilingual Instruction Tuning With Just a Pinch of Multilinguality

How Many Languages Make Good Multilingual Instruction Tuning? A Case Study on BLOOM

Demystifying Instruction Mixing for Fine-tuning Large Language Models

Eliciting the Translation Ability of Large Language Models via Multilingual Finetuning with Translation Instructions

Deep Exploration of Cross-Lingual Zero-Shot Generalization in Instruction Tuning

Zero-shot cross-lingual transfer in instruction tuning of large language models

Instruction Mining: Instruction Data Selection for Tuning Large Language Models

Non-instructional Fine-tuning: Enabling Instruction-Following Capabilities in Pre-trained Language Models without Instruction-Following Data

INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning

Empowering Cross-lingual Abilities of Instruction-tuned Large Language Models by Translation-following demonstrations