Abstract:Mixture-of-Experts (MoE) has emerged as a favorable architecture in the era of large models due to its inherent advantage, i.e., enlarging model capacity without incurring notable computational overhead. Yet, the realization of such benefits often results in ineffective GPU memory utilization, as large portions of the model parameters remain dormant during inference. Moreover, the memory demands of large models consistently outpace the memory capacity of contemporary GPUs. Addressing this, we introduce SiDA-MoE ($\textbf{S}$parsity-$\textbf{i}$nspired $\textbf{D}$ata-$\textbf{A}$ware), an efficient inference approach tailored for large MoE models. SiDA-MoE judiciously exploits both the system's main memory, which is now abundant and readily scalable, and GPU memory by capitalizing on the inherent sparsity on expert activation in MoE models. By adopting a data-aware perspective, SiDA-MoE achieves enhanced model efficiency with a neglectable performance drop. Specifically, SiDA-MoE attains a remarkable speedup in MoE inference with up to $3.93\times$ throughput increasing, up to $72\%$ latency reduction, and up to $80\%$ GPU memory saving with down to $1\%$ performance drop. This work paves the way for scalable and efficient deployment of large MoE models, even with constrained resources. Code is available at:

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper "SIDA: S PARSITY -INSPIRED DATA - AWARE SERVING FOR EFFICIENT AND SCALABLE LARGE MIXTURE - OF - EXPERTS MODELS" aims to solve two main problems faced by large Mixture - of - Experts (MoE) models during the inference process: 1. **Inefficient GPU memory utilization**: - Although MoE models increase the number of parameters to improve the model capacity, most of these parameters are inactive during the inference process, resulting in low GPU memory utilization. For example, in Switch Transformers, even for shorter sentences, the ineffective GPU memory occupation is as high as 24GB to 50GB. - This inefficient memory utilization limits the deployment and scaling of large MoE models in resource - constrained systems. 2. **High expert selection overhead**: - During the forward propagation, selecting appropriate experts consumes a great deal of time. Especially for larger models, such as Switch - base - 256, the expert selection process accounts for nearly 75% of the total inference time, becoming a bottleneck for inference latency. - As the model scale increases, this overhead becomes more significant, further highlighting the importance of solving this problem. ### Solutions To address the above problems, the paper proposes an efficient inference system named SiDA (Sparsity - Inspired Data - Aware). The main features and contributions of SiDA are as follows: 1. **Data - aware hash function**: - SiDA uses an off - line trained hash function to predict the activated experts in each batch and their corresponding scaling factors. This hash function can obtain the activation pattern of each sample in advance, thereby achieving dynamic loading and unloading of experts without interrupting the inference process. - The design of the hash function takes into account the sparse cross - embedding dependencies, that is, a limited number of embeddings in the sequence jointly affect the activation of experts. This enables the hash function to accurately predict the activation state of experts while remaining lightweight. 2. **Efficient dynamic loading and unloading mechanism**: - SiDA includes two threads running in parallel: an inference thread and a hash construction thread. The hash construction thread is responsible for constructing the hash table for each batch, while the inference thread dynamically manages the experts in the MoE layer according to the hash table. - In this way, SiDA can dynamically unload inactive experts to RAM during the inference process, thereby significantly reducing GPU memory occupation while maintaining a high inference speed. 3. **Significant performance improvement**: - Experimental results show that SiDA can achieve a throughput increase of up to 3.93 times, a latency reduction of up to 75%, and a GPU memory savings of up to 80% while keeping the model performance degradation within 1%. ### Conclusion By introducing the data - aware hash function and the efficient dynamic loading and unloading mechanism, SiDA effectively solves the problems of inefficient memory utilization and high expert selection overhead faced by large MoE models during the inference process. This provides a new solution for the efficient deployment and scaling of large MoE models in resource - constrained systems.

SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services

MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models

MoE-Infinity: Offloading-Efficient MoE Model Serving

SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts

SwapMoE: Efficient Memory-Constrained Serving of Large Sparse MoE Models Via Dynamic Expert Pruning and Swapping

Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (moe) Inference

ScheMoE

AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference

ProMoE: Fast MoE-based LLM Serving using Proactive Caching

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference

ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling.

Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts

Staleness-Centric Optimizations for Efficient Diffusion MoE Inference

HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

A Survey on Inference Optimization Techniques for Mixture of Experts Models

HEXA-MoE: Efficient and Heterogeneous-aware MoE Acceleration with ZERO Computation Redundancy