MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting

Tianhao Li,Shangjie Li,Binbin Xie,Deyi Xiong,Baosong Yang
2024-06-25
Abstract:The advent of large language models (LLMs) has predominantly catered to high-resource languages, leaving a disparity in performance for low-resource languages. Conventional Continual Training (CT) approaches to bridge this gap often undermine a model's original linguistic proficiency when expanding to multilingual contexts. Addressing this issue, we introduce a novel MoE-CT architecture, a paradigm that innovatively separates the base model's learning from the multilingual expansion process. Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency. Our approach significantly outperforms conventional CT methods, as evidenced by our experiments, which show marked improvements in multilingual benchmarks without sacrificing the model's original language performance. Moreover, our MoE-CT framework demonstrates enhanced resistance to forgetting and superior transfer learning capabilities. By preserving the base model's integrity and focusing on strategic parameter expansion, our methodology advances multilingual language modeling and represents a significant step forward for low-resource language inclusion in LLMs, indicating a fruitful direction for future research in language technologies.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem this paper attempts to address is the degradation of original language capabilities of large language models (LLMs) when scaled to multilingual environments. Specifically, existing continual training (CT) methods, while improving performance in low-resource languages, often weaken the model's original performance in high-resource languages, leading to the so-called "catastrophic forgetting" phenomenon. Moreover, these methods typically require a large amount of original language data to mitigate the forgetting issue, which not only increases training costs but also limits the enhancement of multilingual capabilities. To address these issues, the authors propose a new architecture—MoE-CT (Mixture of Experts for Continual Training). This architecture enhances the performance of low-resource languages without sacrificing the original language capabilities by freezing the parameters of the original LLM and adding an MoE module specifically to handle multilingual data. This approach not only improves the performance on multilingual tasks but also significantly enhances the model's resistance to forgetting and its transfer learning capabilities.