Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

Ranggi Hwang,Jianyu Wei,Shijie Cao,Changho Hwang,Xiaohu Tang,Ting Cao,Mao Yang
2024-04-27
Abstract:Large language models (LLMs) based on transformers have made significant strides in recent years, the success of which is driven by scaling up their model size. Despite their high algorithmic performance, the computational and memory requirements of LLMs present unprecedented challenges. To tackle the high compute requirements of LLMs, the Mixture-of-Experts (MoE) architecture was introduced which is able to scale its model size without proportionally scaling up its computational requirements. Unfortunately, MoE's high memory demands and dynamic activation of sparse experts restrict its applicability to real-world problems. Previous solutions that offload MoE's memory-hungry expert parameters to CPU memory fall short because the latency to migrate activated experts from CPU to GPU incurs high performance overhead. Our proposed Pre-gated MoE system effectively tackles the compute and memory challenges of conventional MoE architectures using our algorithm-system co-design. Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation, allowing our proposed system to address the large memory footprint of MoEs while also achieving high performance. We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality. These features allow our Pre-gated MoE system to cost-effectively deploy large-scale LLMs using just a single GPU with high performance.
Machine Learning,Artificial Intelligence,Hardware Architecture
What problem does this paper attempt to address?
The paper primarily addresses the challenges faced by the Mixture-of-Experts (MoE) architecture in large language models (LLMs) and proposes a solution. Specifically, the paper tackles the following two core issues: 1. **High Memory Demand and Dynamic Sparse Activation**: Although traditional MoE architectures can reduce computational costs by sparsely activating a portion of the expert layers, their memory demand remains very high. Additionally, since the activation of expert layers is dynamically determined based on input data, this leads to increased deployment costs and low GPU resource utilization. 2. **Performance Overhead from Expert Parameter Migration**: Previous methods have attempted to offload expert parameters from GPU memory to CPU memory or solid-state drives (SSD) to reduce the number of GPUs required and improve GPU memory utilization. However, this approach requires migrating the activated experts from the CPU to the GPU at runtime, which introduces significant latency and affects the quality of service (QoS). To address the above issues, the paper proposes the **Pre-gated Mixture-of-Experts System (Pre-gated MoE)**, an algorithm-system co-design approach that effectively reduces GPU memory consumption while maintaining high performance. The key contributions of Pre-gated MoE include: - **Pre-gate Function**: In traditional MoE, the gating function is used to select the experts to be activated in the current MoE block. In Pre-gated MoE, the pre-gate function is trained to pre-select the experts to be activated in the next MoE block. This design eliminates the sequential dependency between expert selection and execution, allowing the system to migrate the experts to be used in the next MoE block from the CPU to the GPU while executing the current MoE block. - **System-level Design**: Pre-gated MoE stores expert parameters in CPU memory and uses the pre-gate function to migrate the experts to be activated in the next MoE block to the GPU in advance while executing the current MoE block, thereby reducing the performance impact of expert migration. In summary, Pre-gated MoE aims to overcome the limitations of existing MoE architectures, providing a more efficient and cost-effective solution for the effective deployment of large-scale language models.