Abstract:Sparse Mixture-of-Experts (SMoE) models represent a significant advancement in large language model (LLM) development through their efficient parameter utilization. These models achieve substantial performance improvements at reduced inference costs. However, the deployment of SMoE models faces constraints from extensive memory requirements of expert components in resource-limited environments. To address these limitations, this paper introduces Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE), a task-agnostic expert merging framework for parameter reduction without retraining. HC-SMoE introduces a novel hierarchical clustering approach based on expert outputs to ensure merging robustness independent of routing decisions. The proposed output-based clustering method enables effective capture of functional relationships between experts for large-scale architectures. We provide theoretical analysis and comprehensive evaluations across multiple zero-shot language tasks to demonstrate HC-SMoE's effectiveness in state-of-the-art models including Qwen and Mixtral. The experimental results validate HC-SMoE's superior performance and practical applicability for real-world deployments.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the excessive memory requirements faced by the Sparse Mixture - of - Experts (SMoE) when deployed in resource - constrained environments. Specifically, although the SMoE model achieves significant performance improvement and low inference cost through efficient use of parameters, the large memory requirements of its expert components limit its application in practical environments. To solve this problem, the authors propose the Hierarchical Clustering for Sparsely activated Mixture of Experts (HC - SMoE). HC - SMoE aims to merge experts through the hierarchical clustering method, thereby reducing the number of model parameters without retraining the model. This method can not only effectively reduce memory usage but also maintain model performance and is applicable to various tasks without the need for task - specific adjustments. ### Main Contributions 1. **Propose HC - SMoE**: A general - purpose and scalable expert merging strategy that does not require retraining. 2. **Use expert output as a similarity measure**: Compared with traditional routing scores or weights, expert output can more effectively capture functional relationships. 3. **Theoretically and experimentally prove the importance of clustering quality**: The proposed hierarchical clustering method theoretically guarantees and experimentally verifies the effectiveness of expert grouping. 4. **Extensive experimental verification**: Through multiple benchmarks and SMoE models of different scales, the consistently superior performance of HC - SMoE is proven. ### Method Overview The core idea of HC - SMoE is to group experts according to their outputs through hierarchical clustering and finally merge these experts to reduce the number of parameters. The specific steps are as follows: 1. **Calculate the similarity of expert outputs**: Use the average expert output on the calibration dataset as a similarity measure. 2. **Hierarchical clustering**: Perform hierarchical clustering based on the distance measure of expert outputs and gradually merge the most similar experts. 3. **Expert merging**: In each cluster, merge experts into new experts through methods such as weighted averaging. ### Experimental Results The experimental results show that HC - SMoE performs well on different model sizes and tasks. In particular, when the number of experts is reduced, the performance degradation is relatively small, and it even outperforms the original model on some tasks. For example, in the Qwen and Mixtral models, even when the number of experts is reduced by 50%, HC - SMoE still maintains high performance and is superior to other baseline methods. In conclusion, by introducing the HC - SMoE framework, this paper successfully solves the memory limitation problem of the SMoE model in practical deployment and provides new ideas and methods for the optimization of large - scale language models.

Retraining-Free Merging of Sparse MoE via Hierarchical Clustering

Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy

Multi-Head Mixture-of-Experts

HMoE: Heterogeneous Mixture of Experts for Language Modeling

Merging Experts into One: Improving Computational Efficiency of Mixture of Experts

Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs

Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing

SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts

LocMoE: A Low-Overhead MoE for Large Language Model Training

Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model

Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

A Closer Look into Mixture-of-Experts in Large Language Models

Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts

MH-MoE: Multi-Head Mixture-of-Experts

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

SimSMoE: Solving Representational Collapse via Similarity Measure

CompeteSMoE -- Effective Training of Sparse Mixture of Experts via Competition