Retraining-Free Merging of Sparse Mixture-of-Experts via Hierarchical Clustering

I-Chun Chen,Hsu-Shen Liu,Wei-Fang Sun,Chen-Hao Chao,Yen-Chang Hsu,Chun-Yi Lee
2024-10-11
Abstract:Sparse Mixture-of-Experts (SMoE) models represent a significant breakthrough in large language model development. These models enable performance improvements without a proportional increase in inference costs. By selectively activating a small set of parameters during task execution, SMoEs enhance model capacity. However, their deployment remains challenging due to the substantial memory footprint required to accommodate the growing number of experts. This constraint renders them less feasible in environments with limited hardware resources. To address this challenge, we propose Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE), a task-agnostic expert merging framework that reduces SMoE model parameters without retraining. Unlike previous methods, HC-SMoE employs hierarchical clustering based on expert outputs. This approach ensures that the merging process remains unaffected by routing decisions. The output-based clustering strategy captures functional similarities between experts, offering an adaptable solution for models with numerous experts. We validate our approach through extensive experiments on eight zero-shot language tasks and demonstrate its effectiveness in large-scale SMoE models such as Qwen and Mixtral. Our comprehensive results demonstrate that HC-SMoE consistently achieves strong performance, which highlights its potential for real-world deployment.
Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the problem of reducing the number of experts in Sparse Mixture-of-Experts (SMoE) models without retraining, in order to lower the memory footprint and computational cost while maintaining model performance. Specifically, the paper proposes a hierarchical clustering-based method (Hierarchical Clustering for Sparsely Activated Mixture of Experts, HC-SMoE) that reduces model parameters by clustering and merging experts, thereby improving the feasibility of model deployment, especially in resource-constrained environments, without affecting model performance. ### Main Issues: 1. **High Memory Footprint**: SMoE models require storing a large number of expert parameters, leading to very high memory usage, which limits their application in resource-constrained environments. 2. **High Computational Cost**: Although SMoE models reduce computation during inference through sparse activation, the overall model size is still large, resulting in high computational costs. 3. **Optimization Without Retraining**: Existing methods for reducing the number of experts usually require retraining the model, which is impractical for large-scale models. ### Solution: - **Hierarchical Clustering**: The paper proposes a hierarchical clustering-based method that clusters experts based on the similarity of their outputs, ensuring that the merged experts can retain the functionality of the original experts. - **No Retraining**: This method does not require retraining the model, thereby saving a significant amount of computational resources. - **Task Agnostic**: HC-SMoE is a task-agnostic method, applicable to various tasks and models. ### Experimental Validation: - **Datasets**: The experiments used the C4 dataset for calibration and evaluated model performance on multiple zero-shot language tasks. - **Benchmark Comparison**: The method was compared with existing expert pruning and merging methods, including O-prune, S-prune, F-prune, and M-SMoE. - **Results**: Experimental results show that HC-SMoE can maintain or even improve model performance while reducing the number of experts, particularly excelling in large-scale models like Qwen and Mixtral. ### Contributions: 1. Proposed the first retraining-free, task-agnostic, and scalable SMoE merging strategy. 2. Validated the effectiveness of using expert outputs as a similarity measure for clustering. 3. Emphasized the importance of clustering quality for merging effectiveness and demonstrated the advantages of the hierarchical clustering method. 4. Experimental results indicate that HC-SMoE performs excellently across multiple benchmarks, making it suitable for large-scale SMoE models.