Staleness-Centric Optimizations for Efficient Diffusion MoE Inference

Jiajun Luo,Lizhuo Luo,Jianru Xu,Jiajun Song,Rongwei Lu,Chen Tang,Zhi Wang
2024-11-25
Abstract:Mixture-of-experts-based (MoE-based) diffusion models have shown their scalability and ability to generate high-quality images, making them a promising choice for efficient model scaling. However, they rely on expert parallelism across GPUs, necessitating efficient parallelism optimization. While state-of-the-art diffusion parallel inference methods overlap communication and computation via displaced operations, they introduce substantial staleness -- the utilization of outdated activations, which is especially severe in expert parallelism scenarios and leads to significant performance degradation. We identify this staleness issue and propose DICE, a staleness-centric optimization with a three-fold approach: (1) Interweaved Parallelism reduces step-level staleness for free while overlapping communication and computation; (2) Selective Synchronization operates at layer-level and protects critical layers vulnerable from staled activations; and (3) Conditional Communication, a token-level, training-free method that dynamically adjusts communication frequency based on token importance. Together, these optimizations effectively reduce staleness, achieving up to 1.2x speedup with minimal quality degradation. Our results establish DICE as an effective, scalable solution for large-scale MoE-based diffusion model inference.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
This paper attempts to solve the communication bottleneck problem encountered in efficient parallel inference in Mixture - of - Experts (MoE) - based diffusion models, especially the staleness problem caused by asynchronous communication. Specifically: 1. **Identifying the staleness problem**: The paper points out that in the inference process of large - scale MoE diffusion models using expert parallelism, existing methods such as displaced parallelism, although reducing the waiting time by overlapping communication and computation, introduce significant staleness, that is, using outdated activation values, which will lead to a decline in model performance. Especially in image generation tasks, the Fréchet Inception Distance (FID) score increases from 5.31 to 8.27. 2. **Proposing solutions**: To alleviate this problem, the paper proposes the DICE framework, which reduces staleness through the following three optimization strategies: - **Interweaved Parallelism**: By redefining the schedule of communication and computation, the step - level staleness is reduced from two steps to one step, thereby reducing the buffer size and maintaining computational efficiency. - **Selective Synchronization**: At the layer level, only the deep layers that are sensitive to staleness are synchronized, while the shallow layers continue to be processed asynchronously to ensure the timely update of key information. - **Conditional Communication**: At the token level, the communication frequency is dynamically adjusted according to the importance of tokens, giving priority to transmitting the activation values of important tokens and reducing unnecessary data transmission. 3. **Verifying the effect**: The paper verifies the effectiveness of the DICE framework through experiments, showing that while maintaining image quality, it achieves an acceleration of up to 1.2 times and also reduces memory usage. In conclusion, the main contribution of this paper lies in identifying and solving the staleness problem in MoE diffusion model inference. By proposing the DICE framework, it effectively improves the inference efficiency and performance of the model.