SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models

Zhixu Du,Shiyu Li,Yuhao Wu,Xiangyu Jiang,Jingwei Sun,Qilin Zheng,Yongkai Wu,Ang Li,Hai "Helen" Li,Yiran Chen
2024-05-18
Abstract:Mixture-of-Experts (MoE) has emerged as a favorable architecture in the era of large models due to its inherent advantage, i.e., enlarging model capacity without incurring notable computational overhead. Yet, the realization of such benefits often results in ineffective GPU memory utilization, as large portions of the model parameters remain dormant during inference. Moreover, the memory demands of large models consistently outpace the memory capacity of contemporary GPUs. Addressing this, we introduce SiDA-MoE ($\textbf{S}$parsity-$\textbf{i}$nspired $\textbf{D}$ata-$\textbf{A}$ware), an efficient inference approach tailored for large MoE models. SiDA-MoE judiciously exploits both the system's main memory, which is now abundant and readily scalable, and GPU memory by capitalizing on the inherent sparsity on expert activation in MoE models. By adopting a data-aware perspective, SiDA-MoE achieves enhanced model efficiency with a neglectable performance drop. Specifically, SiDA-MoE attains a remarkable speedup in MoE inference with up to $3.93\times$ throughput increasing, up to $72\%$ latency reduction, and up to $80\%$ GPU memory saving with down to $1\%$ performance drop. This work paves the way for scalable and efficient deployment of large MoE models, even with constrained resources. Code is available at:
Machine Learning,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper "SIDA: S PARSITY -INSPIRED DATA - AWARE SERVING FOR EFFICIENT AND SCALABLE LARGE MIXTURE - OF - EXPERTS MODELS" aims to solve two main problems faced by large Mixture - of - Experts (MoE) models during the inference process: 1. **Inefficient GPU memory utilization**: - Although MoE models increase the number of parameters to improve the model capacity, most of these parameters are inactive during the inference process, resulting in low GPU memory utilization. For example, in Switch Transformers, even for shorter sentences, the ineffective GPU memory occupation is as high as 24GB to 50GB. - This inefficient memory utilization limits the deployment and scaling of large MoE models in resource - constrained systems. 2. **High expert selection overhead**: - During the forward propagation, selecting appropriate experts consumes a great deal of time. Especially for larger models, such as Switch - base - 256, the expert selection process accounts for nearly 75% of the total inference time, becoming a bottleneck for inference latency. - As the model scale increases, this overhead becomes more significant, further highlighting the importance of solving this problem. ### Solutions To address the above problems, the paper proposes an efficient inference system named SiDA (Sparsity - Inspired Data - Aware). The main features and contributions of SiDA are as follows: 1. **Data - aware hash function**: - SiDA uses an off - line trained hash function to predict the activated experts in each batch and their corresponding scaling factors. This hash function can obtain the activation pattern of each sample in advance, thereby achieving dynamic loading and unloading of experts without interrupting the inference process. - The design of the hash function takes into account the sparse cross - embedding dependencies, that is, a limited number of embeddings in the sequence jointly affect the activation of experts. This enables the hash function to accurately predict the activation state of experts while remaining lightweight. 2. **Efficient dynamic loading and unloading mechanism**: - SiDA includes two threads running in parallel: an inference thread and a hash construction thread. The hash construction thread is responsible for constructing the hash table for each batch, while the inference thread dynamically manages the experts in the MoE layer according to the hash table. - In this way, SiDA can dynamically unload inactive experts to RAM during the inference process, thereby significantly reducing GPU memory occupation while maintaining a high inference speed. 3. **Significant performance improvement**: - Experimental results show that SiDA can achieve a throughput increase of up to 3.93 times, a latency reduction of up to 75%, and a GPU memory savings of up to 80% while keeping the model performance degradation within 1%. ### Conclusion By introducing the data - aware hash function and the efficient dynamic loading and unloading mechanism, SiDA effectively solves the problems of inefficient memory utilization and high expert selection overhead faced by large MoE models during the inference process. This provides a new solution for the efficient deployment and scaling of large MoE models in resource - constrained systems.