Abstract:In this paper, we present DiT-MoE, a sparse version of the diffusion Transformer, that is scalable and competitive with dense networks while exhibiting highly optimized inference. The DiT-MoE includes two simple designs: shared expert routing and expert-level balance loss, thereby capturing common knowledge and reducing redundancy among the different routed experts. When applied to conditional image generation, a deep analysis of experts specialization gains some interesting observations: (i) Expert selection shows preference with spatial position and denoising time step, while insensitive with different class-conditional information; (ii) As the MoE layers go deeper, the selection of experts gradually shifts from specific spacial position to dispersion and balance. (iii) Expert specialization tends to be more concentrated at the early time step and then gradually uniform after half. We attribute it to the diffusion process that first models the low-frequency spatial information and then high-frequency complex information. Based on the above guidance, a series of DiT-MoE experimentally achieves performance on par with dense networks yet requires much less computational load during inference. More encouragingly, we demonstrate the potential of DiT-MoE with synthesized image data, scaling diffusion model at a 16.5B parameter that attains a new SoTA FID-50K score of 1.80 in 512$\times$512 resolution settings. The project page: <a class="link-external link-https" href="https://github.com/feizc/DiT-MoE" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper attempts to address the problem of reducing the computational cost of diffusion models in conditional image generation tasks while maintaining high performance. Specifically, the paper proposes a sparse version of the diffusion Transformer model called DiT-MoE (Diffusion Transformer Mixture of Experts). By introducing shared expert routing and expert-level balancing loss designs, DiT-MoE aims to improve parameter efficiency, reduce redundancy, and optimize load balancing among experts. Additionally, the paper explores the performance of the expert selection mechanism in different scenarios, including spatial location, denoising time steps, and class-conditional information. The main contributions of the paper include: 1. **Application of MoE in Diffusion Transformers**: Proposes DiT-MoE, a sparsely activated diffusion Transformer model for image synthesis, which captures common knowledge and minimizes redundancy among routing experts through shared expert components and auxiliary expert-level balancing loss. 2. **Expert Routing Analysis**: Through statistical analysis of expert selection in different scenarios, interesting phenomena regarding expert selection preferences are discovered, which can effectively guide future network design and interpretability research. 3. **Large-Scale Model Parameters**: Introduces a series of DiT-MoE models and demonstrates that these models can be stably trained and efficiently inferred. Notably, by using synthetic data, the model parameters were successfully scaled up to 16.5 billion, achieving a new best FID-50K score of 1.80 at 512 × 512 resolution. 4. **Performance and Inference**: Experimental results show that DiT-MoE significantly outperforms dense models in conditional image generation tasks and can flexibly match the performance of the largest dense models during inference while requiring only half the computation. Finally, the paper releases the code and trained model checkpoints. In summary, the paper effectively addresses the computational cost issue of diffusion models in conditional image generation tasks by introducing sparse computation techniques while maintaining high performance.

Scaling Diffusion Transformers to 16 Billion Parameters

EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing

Dynamic Diffusion Transformer

Scalable Diffusion Models with Transformers

TerDiT: Ternary Diffusion Models with Transformers

Scaling Laws For Diffusion Transformers

Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotic Manipulation

DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer

EDT: An Efficient Diffusion Transformer Framework Inspired by Human-like Sketching

Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization

DiffiT: Diffusion Vision Transformers for Image Generation

Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts

Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation

FasterDiT: Towards Faster Diffusion Transformers Training without Architecture Modification

Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers

SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Accelerating Vision Diffusion Transformers with Skip Branches