MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting

Tianhao Li,Shangjie Li,Binbin Xie,Deyi Xiong,Baosong Yang

2024-06-25

Abstract:The advent of large language models (LLMs) has predominantly catered to high-resource languages, leaving a disparity in performance for low-resource languages. Conventional Continual Training (CT) approaches to bridge this gap often undermine a model's original linguistic proficiency when expanding to multilingual contexts. Addressing this issue, we introduce a novel MoE-CT architecture, a paradigm that innovatively separates the base model's learning from the multilingual expansion process. Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency. Our approach significantly outperforms conventional CT methods, as evidenced by our experiments, which show marked improvements in multilingual benchmarks without sacrificing the model's original language performance. Moreover, our MoE-CT framework demonstrates enhanced resistance to forgetting and superior transfer learning capabilities. By preserving the base model's integrity and focusing on strategic parameter expansion, our methodology advances multilingual language modeling and represents a significant step forward for low-resource language inclusion in LLMs, indicating a fruitful direction for future research in language technologies.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The problem this paper attempts to address is the degradation of original language capabilities of large language models (LLMs) when scaled to multilingual environments. Specifically, existing continual training (CT) methods, while improving performance in low-resource languages, often weaken the model's original performance in high-resource languages, leading to the so-called "catastrophic forgetting" phenomenon. Moreover, these methods typically require a large amount of original language data to mitigate the forgetting issue, which not only increases training costs but also limits the enhancement of multilingual capabilities. To address these issues, the authors propose a new architecture—MoE-CT (Mixture of Experts for Continual Training). This architecture enhances the performance of low-resource languages without sacrificing the original language capabilities by freezing the parameters of the original LLM and adding an MoE module specifically to handle multilingual data. This approach not only improves the performance on multilingual tasks but also significantly enhances the model's resistance to forgetting and its transfer learning capabilities.

MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting

MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

LoRAMoE: Alleviating World Knowledge Forgetting in Large Language Models via MoE-Style Plugin

LoRAMoE: Alleviating World Knowledge Forgetting in Large Language Models Via MoE-Style Plugin.

LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin

Augmenting Language Models with Long-Term Memory

Large Language Model Can Continue Evolving From Mistakes

CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory

Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE

Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale

AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies

Large Language Models aren't all that you need

Investigating the Catastrophic Forgetting in Multimodal Large Language Models

CMT: A Memory Compression Method for Continual Knowledge Learning of Large Language Models

Unlocking Emergent Modularity in Large Language Models

Cross-model Control: Improving Multiple Large Language Models in One-time Training

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

Lego-MT: Learning Detachable Models for Massively Multilingual Machine Translation

Llama 3 Meets MoE: Efficient Upcycling

MMNMT: Modularizing Multilingual Neural Machine Translation with Flexibly Assembled MoE and Dense Blocks