Abstract:The Mixture-of-Experts (MoE) paradigm has emerged as a powerful approach for scaling transformers with improved resource utilization. However, efficiently fine-tuning MoE models remains largely underexplored. Inspired by recent works on Parameter-Efficient Fine-Tuning (PEFT), we present a unified framework for integrating PEFT modules directly into the MoE mechanism. Aligning with the core principles and architecture of MoE, our framework encompasses a set of design dimensions including various functional and composition strategies. By combining design choices within our framework, we introduce Parameter-Efficient Routed Fine-Tuning (PERFT) as a flexible and scalable family of PEFT strategies tailored for MoE models. Extensive experiments on adapting OLMoE-1B-7B and Mixtral-8$\times$7B for commonsense and arithmetic reasoning tasks demonstrate the effectiveness, scalability, and intriguing dynamics of PERFT. Additionally, we provide empirical findings for each specific design choice to facilitate better application of MoE and PEFT.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to efficiently fine - tune the Mixture - of - Experts (MoE) model to achieve effective downstream task adaptation without fully fine - tuning all parameters. Specifically, the paper mainly focuses on the following aspects: 1. **High cost of fine - tuning the MoE model**: - Although the MoE model significantly reduces the computational cost and maintains the model capacity through the sparse activation mechanism, the large number of expert parameters makes full - scale fine - tuning very expensive. - Existing fine - tuning methods cannot be directly applied to the MoE model because these methods are usually designed for dense models and do not consider the routing mechanism between sparsely activated experts in the MoE model. 2. **Explore specially designed Parameter - Efficient Fine - Tuning (PEFT) techniques**: - To address the above challenges, the paper proposes Parameter - Efficient Routed Fine - Tuning (PERFT), a parameter - efficient fine - tuning framework specifically designed for the MoE model. - PERFT aims to achieve a flexible and scalable fine - tuning strategy by introducing independent or embedded PEFT modules and combining the core principles and architectures of the MoE model. 3. **Evaluate the effectiveness of different design choices**: - The paper verifies the performance of different variants of PERFT (such as PERFT - R, PERFT - E, PERFT - D, and PERFT - S) on common - sense reasoning and arithmetic reasoning tasks through extensive experiments. - The experimental results show that PERFT can significantly improve the fine - tuning efficiency while remaining competitive, especially in cases with a low proportion of parameter activation. ### Formula Summary The key formulas involved in the paper are as follows: - **Forward propagation in the MoE mechanism**: \[ \text{MoE}(h_t)=\sum_{i = 1}^{N}G(h_t)_iE_i(h_t) \] where $G(h_t)=\text{TopK}(\text{Softmax}(h_tW_g, K))$, representing the sparse gating function that assigns each token to the $K$ most active experts. - **PEFT expert update in PERFT**: \[ \Delta(h)=\text{UpProj}(\text{Act}(\text{DownProj}(h))) \] where $\text{DownProj}(h)$ and $\text{UpProj}(h)$ are the dimension - reduction and dimension - increase operations respectively, and $\text{Act}$ is a non - linear activation function. - **Routing mechanism in PERFT - R**: \[ \Delta(h_t)=\sum_{i = 1}^{M}\tilde{G}(h_t)_i\Delta_i(h_t) \] where $\tilde{G}(h_t)$ represents the gating function of the PEFT expert. ### Conclusion By introducing the PERFT framework, the paper solves the problems of excessive parameters and high computational cost in the fine - tuning process of the MoE model, and demonstrates the effectiveness and flexibility of PERFT on different tasks. This provides new perspectives and methods for future research, especially in the efficient adaptation of large - scale models.

PERFT: Parameter-Efficient Routed Fine-Tuning for Mixture-of-Expert Model

MoPEFT: A Mixture-of-PEFTs for the Segment Anything Model

Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning

PEMT: Multi-Task Correlation Guided Mixture-of-Experts Enables Parameter-Efficient Transfer Learning

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models

MoDULA: Mixture of Domain-Specific and Universal LoRA for Multi-Task Learning

Mixture of A Million Experts

Mixture of Physical Priors Adapter for Parameter-Efficient Fine-Tuning

MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter

MoDE: Effective Multi-task Parameter Efficient Fine-Tuning with a Mixture of Dyadic Experts

An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models

Advancing Parameter Efficiency in Fine-tuning via Representation Editing

PEDRO: Parameter-Efficient Fine-tuning with Prompt DEpenDent Representation MOdification

ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization

See Further for Parameter Efficient Fine-tuning by Standing on the Shoulders of Decomposition

Upcycling Instruction Tuning from Dense to Mixture-of-Experts via Parameter Merging

AutoPEFT: Automatic Configuration Search for Parameter-Efficient Fine-Tuning

Higher Layers Need More LoRA Experts

Delving into Parameter-Efficient Fine-Tuning in Code Change Learning: an Empirical Study

MoRe Fine-Tuning with 10x Fewer Parameters