Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

Zihan Wang,Deli Chen,Damai Dai,Runxin Xu,Zhuoshu Li,Y. Wu
2024-07-05
Abstract:Parameter-efficient fine-tuning (PEFT) is crucial for customizing Large Language Models (LLMs) with constrained resources. Although there have been various PEFT methods for dense-architecture LLMs, PEFT for sparse-architecture LLMs is still underexplored. In this work, we study the PEFT method for LLMs with the Mixture-of-Experts (MoE) architecture and the contents of this work are mainly threefold: (1) We investigate the dispersion degree of the activated experts in customized tasks, and found that the routing distribution for a specific task tends to be highly concentrated, while the distribution of activated experts varies significantly across different tasks. (2) We propose Expert-Specialized Fine-Tuning, or ESFT, which tunes the experts most relevant to downstream tasks while freezing the other experts and modules; experimental results demonstrate that our method not only improves the tuning efficiency, but also matches or even surpasses the performance of full-parameter fine-tuning. (3) We further analyze the impact of the MoE architecture on expert-specialized fine-tuning. We find that MoE models with finer-grained experts are more advantageous in selecting the combination of experts that are most relevant to downstream tasks, thereby enhancing both the training efficiency and effectiveness. Our code is available at <a class="link-external link-https" href="https://github.com/deepseek-ai/ESFT" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper discusses the Parameter-Efficient Fine-Tuning (PEFT) method for sparse architecture (Mixture-of-Experts, MoE) in large-scale language models (LLMs). The research found that the distribution of experts activated by different tasks is highly concentrated, while the distribution of experts between different tasks differs significantly. Based on this, the paper proposes the Expert-Specialized Fine-Tuning (ESFT) method, which fine-tunes only the experts most relevant to the downstream task while freezing other experts and modules. Experimental results show that this method not only improves fine-tuning efficiency, but also achieves comparable or even better performance than full-parameter fine-tuning, while saving up to 90% of storage space and 30% of training time. In addition, the paper analyzes the impact of MoE architecture on expert-specialized fine-tuning and finds that fine-grained expert models are better at selecting the most relevant expert combinations for tasks, thereby improving training efficiency and effectiveness. Through experiments, the authors found that training only 5-15% of the experts can achieve good performance in different tasks. They also compared the efficiency of ESFT with other PEFT methods (such as LoRA) under different computational constraints, demonstrating that ESFT can more effectively utilize training resources. In summary, this paper addresses the problem of how to effectively and efficiently customize fine-tuning for sparse architecture of large-scale language models in resource-constrained scenarios.