ProMoE: Fast MoE-based LLM Serving using Proactive Caching

Xiaoniu Song,Zihang Zhong,Rong Chen
2024-10-29
Abstract:The promising applications of large language models are often constrained by the limited GPU memory capacity available on edge devices. Mixture-of-Experts (MoE) models help mitigate this issue by activating only a subset of the model's parameters during computation, allowing the unused parameters to be offloaded to host memory and reducing overall GPU memory demand. However, existing cache-based offloading solutions handle cache misses reactively and significantly impact system performance. In this paper, we propose ProMoE, a novel proactive caching system that leverages intermediate model results to predict subsequent parameter usage. By proactively fetching experts in advance, ProMoE removes the loading time from the critical path and diminishes the performance overhead of offloading. Our evaluations demonstrate that ProMoE achieves an average speedup of 2.13x and 2.84x in the prefill and decode stages respectively, compared to existing offloading solutions.
Distributed, Parallel, and Cluster Computing,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: when running large - scale language models (LLMs) on edge devices, performance problems are caused by the limited GPU memory capacity. Specifically, the Mixture - of - Experts (MoE) model alleviates this problem by activating only some of the parameters, but the existing cache - based offloading solutions will significantly affect system performance when dealing with cache misses. ### Paper Background Large - scale language models (LLMs) have broad application prospects, but these applications are often limited due to the limited GPU memory on edge devices. The MoE model reduces the overall GPU memory requirements by activating only a part of the model's parameters during the calculation process, allowing the unused parameters to be offloaded to the host memory. However, the existing cache - based offloading solutions are passive when dealing with cache misses, which significantly affects system performance. ### Research Objectives To solve the above problems, this paper proposes a new active cache system named ProMoE. ProMoE uses the intermediate model results to predict the usage of subsequent parameters and obtains expert parameters in advance, thereby removing the loading time from the critical path and reducing the performance overhead caused by offloading. ### Main Contributions 1. **A new prediction quality metric "GOODPRED"**: It comprehensively considers the prediction accuracy and lead time. 2. **A learning - based predictor**: It uses the sliding - window method and utilizes historical information for accurate prediction. 3. **A sophisticated pre - fetching mechanism**: It coordinates the execution of the pre - fetching and inference processes, maximizes the overlap between pre - fetching and inference, reduces the inference latency and improves the utilization rate. 4. **An integrated implementation**: ProMoE is integrated into the mainstream LLM frameworks, and its effectiveness and efficiency relative to existing solutions are demonstrated. ### Core Idea of the Solution ProMoE predicts and pre - fetches the required expert parameters in advance through active caching, so that data transmission is no longer on the critical path of inference, thereby reducing latency and increasing GPU utilization. Specifically, ProMoE contains two main components: a predictor and a pre - fetcher. The predictor periodically predicts the selection of experts, and the pre - fetcher pre - loads the experts into the GPU cache according to these predictions. ### Conclusion ProMoE effectively solves the performance bottleneck problem of the MoE model when running on edge devices through the active caching method, and significantly improves the inference speed and GPU utilization. The experimental results show that ProMoE achieves 2.13 - fold and 2.84 - fold acceleration effects in the pre - filling and decoding stages respectively.