TerDiT: Ternary Diffusion Models with Transformers

Xudong Lu,Aojun Zhou,Ziyi Lin,Qi Liu,Yuhui Xu,Renrui Zhang,Yafei Wen,Shuai Ren,Peng Gao,Junchi Yan,Hongsheng Li
2024-05-24
Abstract:Recent developments in large-scale pre-trained text-to-image diffusion models have significantly improved the generation of high-fidelity images, particularly with the emergence of diffusion models based on transformer architecture (DiTs). Among these diffusion models, diffusion transformers have demonstrated superior image generation capabilities, boosting lower FID scores and higher scalability. However, deploying large-scale DiT models can be expensive due to their extensive parameter numbers. Although existing research has explored efficient deployment techniques for diffusion models such as model quantization, there is still little work concerning DiT-based models. To tackle this research gap, in this paper, we propose TerDiT, a quantization-aware training (QAT) and efficient deployment scheme for ternary diffusion models with transformers. We focus on the ternarization of DiT networks and scale model sizes from 600M to 4.2B. Our work contributes to the exploration of efficient deployment strategies for large-scale DiT models, demonstrating the feasibility of training extremely low-bit diffusion transformer models from scratch while maintaining competitive image generation capacities compared to full-precision models. Code will be available at
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
This paper focuses on how to effectively deploy large-scale Diffusion Transformers (DiT) models, which have made significant progress in text-to-image generation tasks. Despite the superior image generation capability of Diffusion Transformers, the large number of parameters leads to high deployment costs. Existing research has explored efficiency improvement techniques such as model quantization, but there is relatively little research on transformer-based diffusion models. The paper proposes a new method called TerDiT, which is a Quantization-Aware Training (QAT) and efficient deployment solution for ternary diffusion models. TerDiT focuses on ternarization of the DiT network, scaling the model from 600 million to 4.2 billion parameters. The research found that directly ternarizing the adaLN module would result in excessive scale and offset values in the normalization layer, affecting model training. To solve this problem, the paper proposes a variation of adaLN, namely applying RMS Norm after the ternary linear layer to improve the training issue. With this approach, TerDiT successfully performs extremely low-bit quantization training on large DiT models while maintaining image generation quality comparable to full-precision models. Experimental results show that TerDiT not only reduces the size of model checkpoints and inference memory consumption but also competes with full-precision models in terms of generation quality. In conclusion, the paper addresses the problem of effectively quantizing and deploying large DiT models, providing new ideas and strategies for deploying efficient diffusion models in resource-constrained environments.