Abstract:While modern internet services, such as chatbots, search engines, and online advertising, demand the use of large-scale deep neural networks (DNNs), distributed training and inference over heterogeneous computing systems are desired to facilitate these DNN models. Mixture-of-Experts (MoE) is one the most common strategies to lower the cost of training subject to the overall size of models/data through gating and parallelism in a divide-and-conquer fashion. While DeepSpeed has made efforts in carrying out large-scale MoE training over heterogeneous infrastructures, the efficiency of training and inference could be further improved from several system aspects, including load balancing, communication/computation efficiency, and memory footprint limits. In this work, we present a novel MoESys that boosts efficiency in both large-scale training and inference. Specifically, in the training procedure, the proposed MoESys adopts an Elastic MoE training strategy with 2D prefetch and Fusion communication over Hierarchical storage, so as to enjoy efficient parallelisms. For scalable inference in a single node, especially when the model size is larger than GPU memory, MoESys builds the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference. We carried out extensive experiments to evaluate MoESys, where MoESys successfully trains a Unified Feature Optimization (UFO) model with a Sparsely-Gated Mixture-of-Experts model of 12B parameters in 8 days on 48 A100 GPU cards. The comparison against the state-of-the-art shows that MoESys outperformed DeepSpeed with 33% higher throughput (tokens per second) in training and 13% higher throughput in inference in general. Particularly, under unbalanced MoE Tasks, e.g., UFO, MoESys achieved 64% higher throughput with 18% lower memory footprints.

What problem does this paper attempt to address?

This paper aims to solve the efficiency and resource utilization problems faced in the training and inference processes of deep neural network (DNN) models in large - scale Internet services. Specifically: 1. **Computing resource utilization problem**: In the Mixture - of - Experts (MoE) model, the computing cost increases with the increase in the number of experts, resulting in the impairment of training performance due to uneven expert allocation. The paper proposes to improve this problem through methods such as auxiliary loss, random selection of experts, and noise in routing. However, these methods mainly focus on scheduling rather than computing itself and require a large amount of CPU resources. 2. **Communication efficiency problem**: The unbalanced routing strategy in the MoE model leads to inconsistent data processing progress in multi - task training, increasing the waiting time. For example, the Switch Transformer model needs to perform four AlltoAll communications in each MoE layer, which will lead to performance degradation in an unknown network topology. 3. **Storage limitation problem**: The size of the MoE model is limited by the memory capacity of the device. The I/O latency differences between different storage media (such as the high - bandwidth memory HBM of GPU, CPU memory, and solid - state drive SSD) lead to latency, so efficient storage management is required to support sparse activation training. To solve the above problems, the paper introduces MoESys, a new system framework for distributed training and inference of MoE models. The main contributions of MoESys include: - Designing a new distributed framework that can be scaled to trillion - parameter - scale MoE models, making full use of HBM, CPU memory, and even SSD in the cluster to break through the memory wall and achieve efficient training scheduling. In particular, MoESys introduces 2D pre - fetching scheduling and fused communication technologies, further improving the efficiency of heterogeneous storage systems. - Proposing a new inference method based on ring - shaped memory, which combines computing and communication as much as possible through dynamic graph scheduling to accelerate the inference process of large - scale MoE models without using additional machines. - Designing a variety of effective training strategies for natural language processing (NLP) and computer vision (CV) tasks, aiming to expand the scale of multi - task learning without increasing memory requirements. These strategies include load balancing, embedding partitioning, and resource - aware communication. - Conducting comprehensive industrial - level experiments, showing a significant improvement in training and inference performance using MoESys, providing a practical reference for the future development of training and inference of large - scale MoE models. Through these innovations, MoESys aims to improve the training and inference efficiency of large - scale MoE models in Internet services while reducing resource consumption.

MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement

Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models

FastMoE: A Fast Mixture-of-Expert Training System

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (moe) Inference

SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models

SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization.

MoE-Infinity: Offloading-Efficient MoE Model Serving

LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing

MoNTA: Accelerating Mixture-of-Experts Training with Network-Traffc-Aware Parallel Optimization

FasterMoE

FASTERMOE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models

EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference

AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference

HEXA-MoE: Efficient and Heterogeneous-aware MoE Acceleration with ZERO Computation Redundancy

Pro-Prophet: A Systematic Load Balancing Method for Efficient Parallel Training of Large-scale MoE Models

MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism

Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts

MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism