Abstract:Mixture-of-experts (MoE) is becoming popular due to its success in improving the model quality, especially in Transformers. By routing tokens with a sparse gate to a few experts (i.e., a small pieces of the full model), MoE can easily increase the model parameters to a very large scale while keeping the computation cost in a constant level. Most existing works just initialize some random experts, set a fixed gating strategy (e.g., Top-k), and train the model from scratch in an ad-hoc way. We identify that these MoE models are suffering from the immature experts and unstable sparse gate, which are harmful to the convergence performance. In this paper, we propose an efficient end-to-end MoE training framework called EvoMoE. EvoMoE starts from training one single expert and gradually evolves into a large and sparse MoE structure. EvoMoE mainly contains two phases: the expert-diversify phase to train the base expert for a while and spawn multiple diverse experts from it, and the gate-sparsify phase to learn an adaptive sparse gate and activate a dynamic number of experts. EvoMoE naturally decouples the joint learning of both the experts and the sparse gate and focuses on learning the basic knowledge with a single expert at the early training stage. Then it diversifies the experts and continues to train the MoE with a novel Dense-to-Sparse gate (DTS-Gate). Specifically, instead of using a permanent sparse gate, DTS-Gate begins as a dense gate that routes tokens to all experts, then gradually and adaptively becomes sparser while routes to fewer experts. Evaluations are conducted on three popular models and tasks, including RoBERTa for masked language modeling task, GPT for language modeling task and Transformer for machine translation task. The results show that EvoMoE outperforms existing baselines, including Switch, BASE Layer, Hash Layer and StableMoE.

Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism

PipeMoE: Accelerating Mixture-of-Experts through Adaptive Pipelining.

EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference

MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism

MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes

Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (moe) Inference

ScheMoE

ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling.

FasterMoE

LocMoE: A Low-Overhead MoE for Large Language Model Training

AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference

LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing

HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts

FastMoE: A Fast Mixture-of-Expert Training System

HEXA-MoE: Efficient and Heterogeneous-aware MoE Acceleration with ZERO Computation Redundancy

Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping

HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate