X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual Instructions

Chong Li,Wen Yang,Jiajun Zhang,Jinliang Lu,Shaonan Wang,Chengqing Zong
2024-05-30
Abstract:Large language models respond well in high-resource languages like English but struggle in low-resource languages. It may arise from the lack of high-quality instruction following data in these languages. Directly translating English samples into these languages can be a solution but unreliable, leading to responses with translation errors and lacking language-specific or cultural knowledge. To address this issue, we propose a novel method to construct cross-lingual instruction following samples with instruction in English and response in low-resource languages. Specifically, the language model first learns to generate appropriate English instructions according to the natural web texts in other languages as responses. The candidate cross-lingual instruction tuning samples are further refined and diversified. We have employed this method to build a large-scale cross-lingual instruction tuning dataset on 10 languages, namely X-Instruction. The instruction data built using our method incorporate more language-specific knowledge compared with the naive translation method. Experimental results have shown that the response quality of the model tuned on X-Instruction greatly exceeds the model distilled from a powerful teacher model, reaching or even surpassing the ones of ChatGPT. In addition, we find that models tuned on cross-lingual instruction following samples can follow the instruction in the output language without further tuning.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges faced by large - scale language models in generating high - quality responses in low - resource languages. Specifically, these problems include: 1. **Lack of high - quality instruction - following data**: In low - resource languages, due to the lack of high - quality instruction - following data, the performance of the model on these languages is not as good as that on high - resource languages such as English. 2. **Unreliability of direct translation**: The method of directly translating English samples into low - resource languages is unreliable. It is easy to introduce translation errors and ignores the knowledge of specific languages or cultures. 3. **Insufficient generation performance**: In low - resource languages, the generation performance of the model is poor, especially when generating complex or domain - specific texts. To address these challenges, the paper proposes a new method to construct cross - language instruction - following samples, which include English instructions and responses in low - resource languages. Through this method, the response quality of the model in low - resource languages can be improved, and the model can better understand and generate texts in these languages.