PERFT: Parameter-Efficient Routed Fine-Tuning for Mixture-of-Expert Model

Yilun Liu,Yunpu Ma,Shuo Chen,Zifeng Ding,Bailan He,Zhen Han,Volker Tresp
2024-11-13
Abstract:The Mixture-of-Experts (MoE) paradigm has emerged as a powerful approach for scaling transformers with improved resource utilization. However, efficiently fine-tuning MoE models remains largely underexplored. Inspired by recent works on Parameter-Efficient Fine-Tuning (PEFT), we present a unified framework for integrating PEFT modules directly into the MoE mechanism. Aligning with the core principles and architecture of MoE, our framework encompasses a set of design dimensions including various functional and composition strategies. By combining design choices within our framework, we introduce Parameter-Efficient Routed Fine-Tuning (PERFT) as a flexible and scalable family of PEFT strategies tailored for MoE models. Extensive experiments on adapting OLMoE-1B-7B and Mixtral-8$\times$7B for commonsense and arithmetic reasoning tasks demonstrate the effectiveness, scalability, and intriguing dynamics of PERFT. Additionally, we provide empirical findings for each specific design choice to facilitate better application of MoE and PEFT.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to efficiently fine - tune the Mixture - of - Experts (MoE) model to achieve effective downstream task adaptation without fully fine - tuning all parameters. Specifically, the paper mainly focuses on the following aspects: 1. **High cost of fine - tuning the MoE model**: - Although the MoE model significantly reduces the computational cost and maintains the model capacity through the sparse activation mechanism, the large number of expert parameters makes full - scale fine - tuning very expensive. - Existing fine - tuning methods cannot be directly applied to the MoE model because these methods are usually designed for dense models and do not consider the routing mechanism between sparsely activated experts in the MoE model. 2. **Explore specially designed Parameter - Efficient Fine - Tuning (PEFT) techniques**: - To address the above challenges, the paper proposes Parameter - Efficient Routed Fine - Tuning (PERFT), a parameter - efficient fine - tuning framework specifically designed for the MoE model. - PERFT aims to achieve a flexible and scalable fine - tuning strategy by introducing independent or embedded PEFT modules and combining the core principles and architectures of the MoE model. 3. **Evaluate the effectiveness of different design choices**: - The paper verifies the performance of different variants of PERFT (such as PERFT - R, PERFT - E, PERFT - D, and PERFT - S) on common - sense reasoning and arithmetic reasoning tasks through extensive experiments. - The experimental results show that PERFT can significantly improve the fine - tuning efficiency while remaining competitive, especially in cases with a low proportion of parameter activation. ### Formula Summary The key formulas involved in the paper are as follows: - **Forward propagation in the MoE mechanism**: \[ \text{MoE}(h_t)=\sum_{i = 1}^{N}G(h_t)_iE_i(h_t) \] where \(G(h_t)=\text{TopK}(\text{Softmax}(h_tW_g, K))\), representing the sparse gating function that assigns each token to the \(K\) most active experts. - **PEFT expert update in PERFT**: \[ \Delta(h)=\text{UpProj}(\text{Act}(\text{DownProj}(h))) \] where \(\text{DownProj}(h)\) and \(\text{UpProj}(h)\) are the dimension - reduction and dimension - increase operations respectively, and \(\text{Act}\) is a non - linear activation function. - **Routing mechanism in PERFT - R**: \[ \Delta(h_t)=\sum_{i = 1}^{M}\tilde{G}(h_t)_i\Delta_i(h_t) \] where \(\tilde{G}(h_t)\) represents the gating function of the PEFT expert. ### Conclusion By introducing the PERFT framework, the paper solves the problems of excessive parameters and high computational cost in the fine - tuning process of the MoE model, and demonstrates the effectiveness and flexibility of PERFT on different tasks. This provides new perspectives and methods for future research, especially in the efficient adaptation of large - scale models.