Abstract:The current trend in deep learning is to scale models to extremely large sizes with the objective of increasing their accuracy. Mixture-of-Expert (MoE) is the most popular pretrained model that makes feasible the training of models with parameters beyond trillion-scale. Thanks to the dynamic activation of experts, i.e., shallow layers specialized in certain domains, it allows for sparse training of bigger models, removing the linearity between model size and computation. However, different from traditional deep learning models, it draws huge challenges to the efficiency of these training systems, including dynamic load imbalance, inefficient synchronous execution mode, and congested all-to-all communication. To address these challenges, we first propose a performance model that can both accurately predict the latency of different operations o f a specific training task, and intuitively analyze its end-to-end performance via a novel roofline-like model. Then, guided by this model, we invent a dynamic shadowing approach to cope with load imbalance, and a smart fine-grained schedule that splits different operations and executes them concurrently. We design a congestion-avoiding expert selection strategy that relieves network congestion for the lower latency of iterations, when modification of expert selection is allowed. We implement and integrate the above optimizations as a general system, FASTERMOE, empowering efficient distributed MoE model training. FASTERMOE is evaluated on different cluster systems using up to 64 GPUs. It achieves 1.37x - 17.87x speedup compared with state-of-the-art systems for large models, including ZeRO, GShard, and BASE Layer. Source code of FASTERMoE is now available at https://github.com/thu-pacman/FasterMoE.

Break a Lag: Triple Exponential Moving Average for Enhanced Optimization

Bidirectional Looking with A Novel Double Exponential Moving Average to Adaptive and Non-adaptive Momentum Optimizers

The AdEMAMix Optimizer: Better, Faster, Older

EXAdam: The Power of Adaptive Cross-Moments

Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

Momentum is All You Need for Data-Driven Adaptive Optimization

Beyond adaptive gradient: Fast-Controlled Minibatch Algorithm for large-scale optimization

DEAM: Adaptive Momentum with Discriminative Weight for Stochastic Optimization

Promoting Exploration in Memory-Augmented Adam using Critical Momenta

Adaptive momentum with discriminative weight for neural network stochastic optimization

Stochastic Momentum Method with Double Acceleration for Regularized Empirical Risk Minimization

NALA: a Nesterov accelerated look-ahead optimizer for deep learning

FASTERMOE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models

Adam with model exponential moving average is effective for nonconvex optimization

A comparative study of recently deep learning optimizers

Adathm: Adaptive Gradient Method Based on Estimates of Third-Order Moments

FAMO: Fast Adaptive Multitask Optimization

ELRA: Exponential learning rate adaption gradient descent optimization method

Exponential Moving Average of Weights in Deep Learning: Dynamics and Benefits

CAME: Confidence-guided Adaptive Memory Efficient Optimization

DeMo: Decoupled Momentum Optimization