MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing

Hao Zhou,Zhijun Wang,Shujian Huang,Xin Huang,Xue Han,Junlan Feng,Chao Deng,Weihua Luo,Jiajun Chen
2024-08-21
Abstract:Large Language Models (LLMs) are often English-centric due to the disproportionate distribution of languages in their pre-training data. Enhancing non-English language capabilities through post-pretraining often results in catastrophic forgetting of the ability of original languages. Previous methods either achieve good expansion with severe forgetting or slight forgetting with poor expansion, indicating the challenge of balancing language expansion while preventing forgetting. In this paper, we propose a method called MoE-LPR (Mixture-of-Experts with Language Priors Routing) to alleviate this problem. MoE-LPR employs a two-stage training approach to enhance the multilingual capability. First, the model is post-pretrained into a Mixture-of-Experts (MoE) architecture by upcycling, where all the original parameters are frozen and new experts are added. In this stage, we focus improving the ability on expanded languages, without using any original language data. Then, the model reviews the knowledge of the original languages with replay data amounting to less than 1% of post-pretraining, where we incorporate language priors routing to better recover the abilities of the original languages. Evaluations on multiple benchmarks show that MoE-LPR outperforms other post-pretraining methods. Freezing original parameters preserves original language knowledge while adding new experts preserves the learning ability. Reviewing with LPR enables effective utilization of multilingual knowledge within the parameters. Additionally, the MoE architecture maintains the same inference overhead while increasing total model parameters. Extensive experiments demonstrate MoE-LPR's effectiveness in improving expanded languages and preserving original language proficiency with superior scalability. Code and scripts are freely available at <a class="link-external link-https" href="https://github.com/zjwang21/MoE-LPR.git" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the issue of catastrophic forgetting in large language models (LLMs) during multilingual expansion. Specifically, current LLMs are often centered around English due to the high proportion of English in their pre-training data. When enhancing non-English language capabilities through post-pretraining, they tend to forget the capabilities of the original language. Existing methods either perform well in expanding new languages but severely forget the original language, or do well in preventing forgetting but perform poorly in expanding new languages. This indicates a significant challenge in balancing language expansion and preventing forgetting. To tackle this challenge, the authors propose a method called MoE-LPR (Mixture-of-Experts with Language Priors Routing). MoE-LPR adopts a two-stage training strategy. First, the model is converted to a Mixture-of-Experts (MoE) architecture, and new expert modules are added through upcycling while freezing the original parameters. In this stage, the focus is on improving the capabilities of the expanded languages without using any data from the original language. Then, in the second stage, the model reviews the knowledge of the original language through a small amount of replay data (less than 1% of the post-pretraining data volume) to better restore the original language capabilities. Experimental results show that MoE-LPR performs excellently in multiple benchmarks, significantly improving the performance of expanded languages while effectively retaining the capabilities of the original language. ### Main Contributions 1. **Two-Stage Training Strategy**: MoE-LPR adopts a two-stage training strategy, with a particular focus on balancing the capabilities of newly expanded languages and the original language. 2. **Language Priors Routing Mechanism**: MoE-LPR introduces the LPR mechanism, which alleviates catastrophic forgetting of the original language through a small amount of replay data (less than 1% of the post-pretraining data volume). The LPR mechanism also generalizes well to untrained languages. 3. **Scalability**: MoE-LPR is designed to easily increase the number of model parameters without increasing inference overhead and the risk of catastrophic forgetting, making it a cost-effective and stable solution for multilingual NLP tasks. ### Method Overview 1. **Post-Pretraining Stage**: - Upcycle the dense model to an MoE architecture, training newly added expert modules with a large amount of monolingual data while freezing the original parameters. - Use load balancing loss to unleash the model's learning potential and maintain training stability. 2. **Review Stage**: - Train the router with a small amount of monolingual data to better utilize the expert modules. - Design LPR training to restore the model's capabilities in the original language using a small amount of replay data. ### Experimental Results Experimental results show that MoE-LPR performs excellently in both expanded and original languages. Compared to baseline methods, MoE-LPR not only significantly improves the performance of expanded languages but also effectively retains the capabilities of the original language. Particularly, when using less than 1% of replay data, MoE-LPR can recover up to approximately 96.6% of the original language performance. ### Conclusion MoE-LPR successfully addresses the issue of catastrophic forgetting during the multilingual expansion process through a two-stage training strategy and the LPR mechanism, providing new insights for developing more robust and multilingual general-purpose LLMs. The method performs excellently in multiple benchmarks, demonstrating its effectiveness and scalability in multilingual NLP tasks.