Retraining-Free Merging of Sparse MoE via Hierarchical Clustering

I-Chun Chen,Hsu-Shen Liu,Wei-Fang Sun,Chen-Hao Chao,Yen-Chang Hsu,Chun-Yi Lee
2025-02-01
Abstract:Sparse Mixture-of-Experts (SMoE) models represent a significant advancement in large language model (LLM) development through their efficient parameter utilization. These models achieve substantial performance improvements at reduced inference costs. However, the deployment of SMoE models faces constraints from extensive memory requirements of expert components in resource-limited environments. To address these limitations, this paper introduces Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE), a task-agnostic expert merging framework for parameter reduction without retraining. HC-SMoE introduces a novel hierarchical clustering approach based on expert outputs to ensure merging robustness independent of routing decisions. The proposed output-based clustering method enables effective capture of functional relationships between experts for large-scale architectures. We provide theoretical analysis and comprehensive evaluations across multiple zero-shot language tasks to demonstrate HC-SMoE's effectiveness in state-of-the-art models including Qwen and Mixtral. The experimental results validate HC-SMoE's superior performance and practical applicability for real-world deployments.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the excessive memory requirements faced by the Sparse Mixture - of - Experts (SMoE) when deployed in resource - constrained environments. Specifically, although the SMoE model achieves significant performance improvement and low inference cost through efficient use of parameters, the large memory requirements of its expert components limit its application in practical environments. To solve this problem, the authors propose the Hierarchical Clustering for Sparsely activated Mixture of Experts (HC - SMoE). HC - SMoE aims to merge experts through the hierarchical clustering method, thereby reducing the number of model parameters without retraining the model. This method can not only effectively reduce memory usage but also maintain model performance and is applicable to various tasks without the need for task - specific adjustments. ### Main Contributions 1. **Propose HC - SMoE**: A general - purpose and scalable expert merging strategy that does not require retraining. 2. **Use expert output as a similarity measure**: Compared with traditional routing scores or weights, expert output can more effectively capture functional relationships. 3. **Theoretically and experimentally prove the importance of clustering quality**: The proposed hierarchical clustering method theoretically guarantees and experimentally verifies the effectiveness of expert grouping. 4. **Extensive experimental verification**: Through multiple benchmarks and SMoE models of different scales, the consistently superior performance of HC - SMoE is proven. ### Method Overview The core idea of HC - SMoE is to group experts according to their outputs through hierarchical clustering and finally merge these experts to reduce the number of parameters. The specific steps are as follows: 1. **Calculate the similarity of expert outputs**: Use the average expert output on the calibration dataset as a similarity measure. 2. **Hierarchical clustering**: Perform hierarchical clustering based on the distance measure of expert outputs and gradually merge the most similar experts. 3. **Expert merging**: In each cluster, merge experts into new experts through methods such as weighted averaging. ### Experimental Results The experimental results show that HC - SMoE performs well on different model sizes and tasks. In particular, when the number of experts is reduced, the performance degradation is relatively small, and it even outperforms the original model on some tasks. For example, in the Qwen and Mixtral models, even when the number of experts is reduced by 50%, HC - SMoE still maintains high performance and is superior to other baseline methods. In conclusion, by introducing the HC - SMoE framework, this paper successfully solves the memory limitation problem of the SMoE model in practical deployment and provides new ideas and methods for the optimization of large - scale language models.