Abstract:While modern internet services, such as chatbots, search engines, and online advertising, demand the use of large-scale deep neural networks (DNNs), distributed training and inference over heterogeneous computing systems are desired to facilitate these DNN models. Mixture-of-Experts (MoE) is one the most common strategies to lower the cost of training subject to the overall size of models/data through gating and parallelism in a divide-and-conquer fashion. While DeepSpeed has made efforts in carrying out large-scale MoE training over heterogeneous infrastructures, the efficiency of training and inference could be further improved from several system aspects, including load balancing, communication/computation efficiency, and memory footprint limits. In this work, we present a novel MoESys that boosts efficiency in both large-scale training and inference. Specifically, in the training procedure, the proposed MoESys adopts an Elastic MoE training strategy with 2D prefetch and Fusion communication over Hierarchical storage, so as to enjoy efficient parallelisms. For scalable inference in a single node, especially when the model size is larger than GPU memory, MoESys builds the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference. We carried out extensive experiments to evaluate MoESys, where MoESys successfully trains a Unified Feature Optimization (UFO) model with a Sparsely-Gated Mixture-of-Experts model of 12B parameters in 8 days on 48 A100 GPU cards. The comparison against the state-of-the-art shows that MoESys outperformed DeepSpeed with 33% higher throughput (tokens per second) in training and 13% higher throughput in inference in general. Particularly, under unbalanced MoE Tasks, e.g., UFO, MoESys achieved 64% higher throughput with 18% lower memory footprints.

ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling.

HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services

FastMoE: A Fast Mixture-of-Expert Training System

FASTERMOE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models

LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing

Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models

Pro-Prophet: A Systematic Load Balancing Method for Efficient Parallel Training of Large-scale MoE Models

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (moe) Inference

HEXA-MoE: Efficient and Heterogeneous-aware MoE Acceleration with ZERO Computation Redundancy

Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism

FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement

Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts

Communication-Efficient Sparsely-Activated Model Training via Sequence Migration and Token Condensation

MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism

SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization.

TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training

MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism

HMoE: Heterogeneous Mixture of Experts for Language Modeling

Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules

EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate