Abstract:One defining characteristic of Mixture-of-Expert (MoE) models is their capacity for conducting sparse computation via expert routing, leading to remarkable scalability. However, backpropagation, the cornerstone of deep learning, requires dense computation, thereby posting challenges in MoE gradient computations. Here, we introduce SparseMixer, a scalable gradient estimator that bridges the gap between backpropagation and sparse expert routing. Unlike typical MoE training which strategically neglects certain gradient terms for the sake of sparse computation and scalability, SparseMixer provides scalable gradient approximations for these terms, enabling reliable gradient estimation in MoE training. Grounded in a numerical ODE framework, SparseMixer harnesses the mid-point method, a second-order ODE solver, to deliver precise gradient approximations with negligible computational overhead. Applying SparseMixer to Switch Transformer on both pre-training and machine translation tasks, SparseMixer showcases considerable performance gain, accelerating training convergence up to 2 times.

What problem does this paper attempt to address?

The paper primarily addresses the expert routing problem in Mixture-of-Expert (MoE) models, particularly how to effectively perform backpropagation while maintaining sparse computation to achieve efficient gradient estimation. The core contribution of the paper is the proposal of the SparseMixer method, a scalable gradient estimator designed to bridge the gap between backpropagation and sparse expert routing. Specifically, SparseMixer addresses the problem in the following ways: 1. **Background and Challenges**: - MoE models achieve efficient and sparse computation through expert routing. - However, traditional backpropagation requires dense computation, which conflicts with the sparsity of MoE models. - To maintain computational efficiency, existing MoE training often ignores certain gradient terms (referred to as \(\nabla_0\) in the paper), which can lead to slow training convergence and suboptimal model performance. 2. **Solution**: - SparseMixer provides a scalable method to approximate these ignored gradient terms. - It is based on the numerical ordinary differential equation (ODE) framework and utilizes the mid-point method, a second-order ODE solver, to accurately approximate gradients with minimal computational overhead. - SparseMixer not only reliably estimates gradients but also preserves the sparse computation characteristic of MoE models. 3. **Experimental Validation**: - Applying SparseMixer to pre-training and machine translation tasks shows that it can significantly accelerate training convergence, by up to 2 times. - SparseMixer also helps MoE models achieve better expert routing training, thereby improving overall model performance. In summary, the paper aims to resolve the incompatibility between backpropagation and expert routing in MoE models. By implementing the SparseMixer method, it achieves efficient and accurate gradient estimation, thereby enhancing the overall performance of MoE models.

Sparse Backpropagation for MoE Training

EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate

GRIN: GRadient-INformed MoE

MoEC: Mixture of Expert Clusters

Efficient Routing in Sparse Mixture-of-Experts

From Sparse to Soft Mixtures of Experts

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

Sparse-MLP: A Fully-MLP Architecture with Conditional Computation

Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

Residual Mixture of Experts

Mixture of Diverse Size Experts

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (moe) Inference

LocMoE: A Low-Overhead MoE for Large Language Model Training

MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts

Sparsely Activated Mixture-of-Experts are Robust Multi-Task Learners

FasterMoE

HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection

Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models