Abstract:The promising applications of large language models are often constrained by the limited GPU memory capacity available on edge devices. Mixture-of-Experts (MoE) models help mitigate this issue by activating only a subset of the model's parameters during computation, allowing the unused parameters to be offloaded to host memory and reducing overall GPU memory demand. However, existing cache-based offloading solutions handle cache misses reactively and significantly impact system performance. In this paper, we propose ProMoE, a novel proactive caching system that leverages intermediate model results to predict subsequent parameter usage. By proactively fetching experts in advance, ProMoE removes the loading time from the critical path and diminishes the performance overhead of offloading. Our evaluations demonstrate that ProMoE achieves an average speedup of 2.13x and 2.84x in the prefill and decode stages respectively, compared to existing offloading solutions.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: when running large - scale language models (LLMs) on edge devices, performance problems are caused by the limited GPU memory capacity. Specifically, the Mixture - of - Experts (MoE) model alleviates this problem by activating only some of the parameters, but the existing cache - based offloading solutions will significantly affect system performance when dealing with cache misses. ### Paper Background Large - scale language models (LLMs) have broad application prospects, but these applications are often limited due to the limited GPU memory on edge devices. The MoE model reduces the overall GPU memory requirements by activating only a part of the model's parameters during the calculation process, allowing the unused parameters to be offloaded to the host memory. However, the existing cache - based offloading solutions are passive when dealing with cache misses, which significantly affects system performance. ### Research Objectives To solve the above problems, this paper proposes a new active cache system named ProMoE. ProMoE uses the intermediate model results to predict the usage of subsequent parameters and obtains expert parameters in advance, thereby removing the loading time from the critical path and reducing the performance overhead caused by offloading. ### Main Contributions 1. **A new prediction quality metric "GOODPRED"**: It comprehensively considers the prediction accuracy and lead time. 2. **A learning - based predictor**: It uses the sliding - window method and utilizes historical information for accurate prediction. 3. **A sophisticated pre - fetching mechanism**: It coordinates the execution of the pre - fetching and inference processes, maximizes the overlap between pre - fetching and inference, reduces the inference latency and improves the utilization rate. 4. **An integrated implementation**: ProMoE is integrated into the mainstream LLM frameworks, and its effectiveness and efficiency relative to existing solutions are demonstrated. ### Core Idea of the Solution ProMoE predicts and pre - fetches the required expert parameters in advance through active caching, so that data transmission is no longer on the critical path of inference, thereby reducing latency and increasing GPU utilization. Specifically, ProMoE contains two main components: a predictor and a pre - fetcher. The predictor periodically predicts the selection of experts, and the pre - fetcher pre - loads the experts into the GPU cache according to these predictions. ### Conclusion ProMoE effectively solves the performance bottleneck problem of the MoE model when running on edge devices through the active caching method, and significantly improves the inference speed and GPU utilization. The experimental results show that ProMoE achieves 2.13 - fold and 2.84 - fold acceleration effects in the pre - filling and decoding stages respectively.

ProMoE: Fast MoE-based LLM Serving using Proactive Caching

Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models

MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

MoE-Infinity: Offloading-Efficient MoE Model Serving

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (moe) Inference

Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference

AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference

SwapMoE: Efficient Memory-Constrained Serving of Large Sparse MoE Models Via Dynamic Expert Pruning and Swapping

EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models

HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference

FASTERMOE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models

MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism

SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget

ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference

MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism

SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models

Pro-Prophet: A Systematic Load Balancing Method for Efficient Parallel Training of Large-scale MoE Models

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing

APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes