xCoT: Cross-lingual Instruction Tuning for Cross-lingual Chain-of-Thought Reasoning

Linzheng Chai,Jian Yang,Tao Sun,Hongcheng Guo,Jiaheng Liu,Bing Wang,Xiannian Liang,Jiaqi Bai,Tongliang Li,Qiyao Peng,Zhoujun Li
2024-01-13
Abstract:Chain-of-thought (CoT) has emerged as a powerful technique to elicit reasoning in large language models and improve a variety of downstream tasks. CoT mainly demonstrates excellent performance in English, but its usage in low-resource languages is constrained due to poor language generalization. To bridge the gap among different languages, we propose a cross-lingual instruction fine-tuning framework (xCOT) to transfer knowledge from high-resource languages to low-resource languages. Specifically, the multilingual instruction training data (xCOT-INSTRUCT) is created to encourage the semantic alignment of multiple languages. We introduce cross-lingual in-context few-shot learning (xICL)) to accelerate multilingual agreement in instruction tuning, where some fragments of source languages in examples are randomly substituted by their counterpart translations of target languages. During multilingual instruction tuning, we adopt the randomly online CoT strategy to enhance the multilingual reasoning ability of the large language model by first translating the query to another language and then answering in English. To further facilitate the language transfer, we leverage the high-resource CoT to supervise the training of low-resource languages with cross-lingual distillation. Experimental results on previous benchmarks demonstrate the superior performance of xCoT in reducing the gap among different languages, highlighting its potential to reduce the cross-lingual gap.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of performance gap between different languages in cross - language reasoning. Specifically, the Chain - of - Thought (CoT) technique performs well in large - language models (LLMs), especially in complex reasoning tasks, but its application is mainly concentrated in high - resource languages (such as English), and its performance in low - resource languages is poor, resulting in performance differences between different languages. To narrow this gap, the author proposes a cross - language instruction fine - tuning framework (XCOT) to achieve knowledge transfer from high - resource languages to low - resource languages. The following are the main contributions of this paper: 1. **Constructing multilingual instruction data**: By translating English instruction data into 10 other languages (such as German, French, Spanish, etc.), a new multilingual instruction dataset (XCOT - INSTRUCT) is created for the training of cross - language chain - of - thought reasoning. 2. **Random - CoT strategy (Random - CoT)**: During the fine - tuning process, the query is first randomly translated into another language and then answered in English to enhance the multilingual reasoning ability of the LLM. 3. **Cross - lingual distillation**: Use the high - quality reasoning paths of high - resource languages to supervise the training of low - resource languages, further improving the performance of low - resource languages. 4. **Code - switched learning**: By mixing fragments of different languages in examples, the model is encouraged to understand and align the representations of different languages. ### Formula presentation The formulas involved in the paper are as follows: - **Probability model of cross - language CoT**: \[ P(a|q, c)=\prod_{j = 1}^{n}P(a_j|a_{<j};q, c, M) \] where \(q\) is the question, \(c\) is the corresponding example, \(a\) is the answer, and \(M\) is the language model. - **Loss function of cross - language instruction fine - tuning**: \[ L_x=-\sum_{i = 1}^{K}\mathbb{E}_{c_{L_i},q_{L_i},a_{L_j}\sim D_{L_i}}\left[\log P(a_{L_j}|q_{L_i},c_{L_i};M)\right] \] - **Loss function of cross - lingual distillation**: \[ L_d =-\frac{1}{n}\sum_{t = 1}^{n}\left[P_t^{\text{high}}\log P_t^{\text{low}}\right] \] where \(P_t^{\text{high}}\) and \(P_t^{\text{low}}\) are the distributions of high - resource and low - resource languages on the \(t\)-th token respectively. Through these methods, the XCOT framework significantly improves the performance of multilingual reasoning tasks, especially in low - resource languages, thereby narrowing the performance gap between different languages.