Abstract:Mixture-of-experts-based (MoE-based) diffusion models have shown their scalability and ability to generate high-quality images, making them a promising choice for efficient model scaling. However, they rely on expert parallelism across GPUs, necessitating efficient parallelism optimization. While state-of-the-art diffusion parallel inference methods overlap communication and computation via displaced operations, they introduce substantial staleness -- the utilization of outdated activations, which is especially severe in expert parallelism scenarios and leads to significant performance degradation. We identify this staleness issue and propose DICE, a staleness-centric optimization with a three-fold approach: (1) Interweaved Parallelism reduces step-level staleness for free while overlapping communication and computation; (2) Selective Synchronization operates at layer-level and protects critical layers vulnerable from staled activations; and (3) Conditional Communication, a token-level, training-free method that dynamically adjusts communication frequency based on token importance. Together, these optimizations effectively reduce staleness, achieving up to 1.2x speedup with minimal quality degradation. Our results establish DICE as an effective, scalable solution for large-scale MoE-based diffusion model inference.

What problem does this paper attempt to address?

This paper attempts to solve the communication bottleneck problem encountered in efficient parallel inference in Mixture - of - Experts (MoE) - based diffusion models, especially the staleness problem caused by asynchronous communication. Specifically: 1. **Identifying the staleness problem**: The paper points out that in the inference process of large - scale MoE diffusion models using expert parallelism, existing methods such as displaced parallelism, although reducing the waiting time by overlapping communication and computation, introduce significant staleness, that is, using outdated activation values, which will lead to a decline in model performance. Especially in image generation tasks, the Fréchet Inception Distance (FID) score increases from 5.31 to 8.27. 2. **Proposing solutions**: To alleviate this problem, the paper proposes the DICE framework, which reduces staleness through the following three optimization strategies: - **Interweaved Parallelism**: By redefining the schedule of communication and computation, the step - level staleness is reduced from two steps to one step, thereby reducing the buffer size and maintaining computational efficiency. - **Selective Synchronization**: At the layer level, only the deep layers that are sensitive to staleness are synchronized, while the shallow layers continue to be processed asynchronously to ensure the timely update of key information. - **Conditional Communication**: At the token level, the communication frequency is dynamically adjusted according to the importance of tokens, giving priority to transmitting the activation values of important tokens and reducing unnecessary data transmission. 3. **Verifying the effect**: The paper verifies the effectiveness of the DICE framework through experiments, showing that while maintaining image quality, it achieves an acceleration of up to 1.2 times and also reduces memory usage. In conclusion, the main contribution of this paper lies in identifying and solving the staleness problem in MoE diffusion model inference. By proposing the DICE framework, it effectively improves the inference efficiency and performance of the model.

Staleness-Centric Optimizations for Efficient Diffusion MoE Inference

Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (moe) Inference

ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference

SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models

Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling

Communication-Efficient Sparsely-Activated Model Training via Sequence Migration and Token Condensation

MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services

EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference

Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism

MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism

APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes

MoE-Infinity: Offloading-Efficient MoE Model Serving

A Survey on Inference Optimization Techniques for Mixture of Experts Models

Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping

Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

ScheMoE

MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts

AdaDiff: Accelerating Diffusion Models through Step-Wise Adaptive Computation