Scaling Diffusion Transformers to 16 Billion Parameters

Zhengcong Fei,Mingyuan Fan,Changqian Yu,Debang Li,Junshi Huang
2024-09-06
Abstract:In this paper, we present DiT-MoE, a sparse version of the diffusion Transformer, that is scalable and competitive with dense networks while exhibiting highly optimized inference. The DiT-MoE includes two simple designs: shared expert routing and expert-level balance loss, thereby capturing common knowledge and reducing redundancy among the different routed experts. When applied to conditional image generation, a deep analysis of experts specialization gains some interesting observations: (i) Expert selection shows preference with spatial position and denoising time step, while insensitive with different class-conditional information; (ii) As the MoE layers go deeper, the selection of experts gradually shifts from specific spacial position to dispersion and balance. (iii) Expert specialization tends to be more concentrated at the early time step and then gradually uniform after half. We attribute it to the diffusion process that first models the low-frequency spatial information and then high-frequency complex information. Based on the above guidance, a series of DiT-MoE experimentally achieves performance on par with dense networks yet requires much less computational load during inference. More encouragingly, we demonstrate the potential of DiT-MoE with synthesized image data, scaling diffusion model at a 16.5B parameter that attains a new SoTA FID-50K score of 1.80 in 512$\times$512 resolution settings. The project page: <a class="link-external link-https" href="https://github.com/feizc/DiT-MoE" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the problem of reducing the computational cost of diffusion models in conditional image generation tasks while maintaining high performance. Specifically, the paper proposes a sparse version of the diffusion Transformer model called DiT-MoE (Diffusion Transformer Mixture of Experts). By introducing shared expert routing and expert-level balancing loss designs, DiT-MoE aims to improve parameter efficiency, reduce redundancy, and optimize load balancing among experts. Additionally, the paper explores the performance of the expert selection mechanism in different scenarios, including spatial location, denoising time steps, and class-conditional information. The main contributions of the paper include: 1. **Application of MoE in Diffusion Transformers**: Proposes DiT-MoE, a sparsely activated diffusion Transformer model for image synthesis, which captures common knowledge and minimizes redundancy among routing experts through shared expert components and auxiliary expert-level balancing loss. 2. **Expert Routing Analysis**: Through statistical analysis of expert selection in different scenarios, interesting phenomena regarding expert selection preferences are discovered, which can effectively guide future network design and interpretability research. 3. **Large-Scale Model Parameters**: Introduces a series of DiT-MoE models and demonstrates that these models can be stably trained and efficiently inferred. Notably, by using synthetic data, the model parameters were successfully scaled up to 16.5 billion, achieving a new best FID-50K score of 1.80 at 512 × 512 resolution. 4. **Performance and Inference**: Experimental results show that DiT-MoE significantly outperforms dense models in conditional image generation tasks and can flexibly match the performance of the largest dense models during inference while requiring only half the computation. Finally, the paper releases the code and trained model checkpoints. In summary, the paper effectively addresses the computational cost issue of diffusion models in conditional image generation tasks by introducing sparse computation techniques while maintaining high performance.