Abstract:Large language models (LLMs) based on transformers have made significant strides in recent years, the success of which is driven by scaling up their model size. Despite their high algorithmic performance, the computational and memory requirements of LLMs present unprecedented challenges. To tackle the high compute requirements of LLMs, the Mixture-of-Experts (MoE) architecture was introduced which is able to scale its model size without proportionally scaling up its computational requirements. Unfortunately, MoE's high memory demands and dynamic activation of sparse experts restrict its applicability to real-world problems. Previous solutions that offload MoE's memory-hungry expert parameters to CPU memory fall short because the latency to migrate activated experts from CPU to GPU incurs high performance overhead. Our proposed Pre-gated MoE system effectively tackles the compute and memory challenges of conventional MoE architectures using our algorithm-system co-design. Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation, allowing our proposed system to address the large memory footprint of MoEs while also achieving high performance. We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality. These features allow our Pre-gated MoE system to cost-effectively deploy large-scale LLMs using just a single GPU with high performance.

What problem does this paper attempt to address?

The paper primarily addresses the challenges faced by the Mixture-of-Experts (MoE) architecture in large language models (LLMs) and proposes a solution. Specifically, the paper tackles the following two core issues: 1. **High Memory Demand and Dynamic Sparse Activation**: Although traditional MoE architectures can reduce computational costs by sparsely activating a portion of the expert layers, their memory demand remains very high. Additionally, since the activation of expert layers is dynamically determined based on input data, this leads to increased deployment costs and low GPU resource utilization. 2. **Performance Overhead from Expert Parameter Migration**: Previous methods have attempted to offload expert parameters from GPU memory to CPU memory or solid-state drives (SSD) to reduce the number of GPUs required and improve GPU memory utilization. However, this approach requires migrating the activated experts from the CPU to the GPU at runtime, which introduces significant latency and affects the quality of service (QoS). To address the above issues, the paper proposes the **Pre-gated Mixture-of-Experts System (Pre-gated MoE)**, an algorithm-system co-design approach that effectively reduces GPU memory consumption while maintaining high performance. The key contributions of Pre-gated MoE include: - **Pre-gate Function**: In traditional MoE, the gating function is used to select the experts to be activated in the current MoE block. In Pre-gated MoE, the pre-gate function is trained to pre-select the experts to be activated in the next MoE block. This design eliminates the sequential dependency between expert selection and execution, allowing the system to migrate the experts to be used in the next MoE block from the CPU to the GPU while executing the current MoE block. - **System-level Design**: Pre-gated MoE stores expert parameters in CPU memory and uses the pre-gate function to migrate the experts to be activated in the next MoE block to the GPU in advance while executing the current MoE block, thereby reducing the performance impact of expert migration. In summary, Pre-gated MoE aims to overcome the limitations of existing MoE architectures, providing a more efficient and cost-effective solution for the effective deployment of large-scale language models.

Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism

MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism

AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference

ProMoE: Fast MoE-based LLM Serving using Proactive Caching

ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference

MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (moe) Inference

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

SwapMoE: Efficient Memory-Constrained Serving of Large Sparse MoE Models Via Dynamic Expert Pruning and Swapping

EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models

EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference

SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget

FASTERMOE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

FasterMoE

HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

Adaptive Gating in Mixture-of-Experts based Language Models

HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference

SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models