TerDiT: Ternary Diffusion Models with Transformers

Xudong Lu,Aojun Zhou,Ziyi Lin,Qi Liu,Yuhui Xu,Renrui Zhang,Yafei Wen,Shuai Ren,Peng Gao,Junchi Yan,Hongsheng Li

2024-05-24

Abstract:Recent developments in large-scale pre-trained text-to-image diffusion models have significantly improved the generation of high-fidelity images, particularly with the emergence of diffusion models based on transformer architecture (DiTs). Among these diffusion models, diffusion transformers have demonstrated superior image generation capabilities, boosting lower FID scores and higher scalability. However, deploying large-scale DiT models can be expensive due to their extensive parameter numbers. Although existing research has explored efficient deployment techniques for diffusion models such as model quantization, there is still little work concerning DiT-based models. To tackle this research gap, in this paper, we propose TerDiT, a quantization-aware training (QAT) and efficient deployment scheme for ternary diffusion models with transformers. We focus on the ternarization of DiT networks and scale model sizes from 600M to 4.2B. Our work contributes to the exploration of efficient deployment strategies for large-scale DiT models, demonstrating the feasibility of training extremely low-bit diffusion transformer models from scratch while maintaining competitive image generation capacities compared to full-precision models. Code will be available at

Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

This paper focuses on how to effectively deploy large-scale Diffusion Transformers (DiT) models, which have made significant progress in text-to-image generation tasks. Despite the superior image generation capability of Diffusion Transformers, the large number of parameters leads to high deployment costs. Existing research has explored efficiency improvement techniques such as model quantization, but there is relatively little research on transformer-based diffusion models. The paper proposes a new method called TerDiT, which is a Quantization-Aware Training (QAT) and efficient deployment solution for ternary diffusion models. TerDiT focuses on ternarization of the DiT network, scaling the model from 600 million to 4.2 billion parameters. The research found that directly ternarizing the adaLN module would result in excessive scale and offset values in the normalization layer, affecting model training. To solve this problem, the paper proposes a variation of adaLN, namely applying RMS Norm after the ternary linear layer to improve the training issue. With this approach, TerDiT successfully performs extremely low-bit quantization training on large DiT models while maintaining image generation quality comparable to full-precision models. Experimental results show that TerDiT not only reduces the size of model checkpoints and inference memory consumption but also competes with full-precision models in terms of generation quality. In conclusion, the paper addresses the problem of effectively quantizing and deploying large DiT models, providing new ideas and strategies for deploying efficient diffusion models in resource-constrained environments.

TerDiT: Ternary Diffusion Models with Transformers

Scalable Diffusion Models with Transformers

Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers

Dynamic Diffusion Transformer

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

TaQ-DiT: Time-aware Quantization for Diffusion Transformers

An Analysis on Quantizing Diffusion Transformers

Effective Diffusion Transformer Architecture for Image Super-Resolution

VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers

PTQ4DiT: Post-training Quantization for Diffusion Transformers

FasterDiT: Towards Faster Diffusion Transformers Training without Architecture Modification

$Δ$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers

GenTron: Diffusion Transformers for Image and Video Generation

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

DiffiT: Diffusion Vision Transformers for Image Generation

DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation

Scaling Diffusion Transformers to 16 Billion Parameters

HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization

DiTAS: Quantizing Diffusion Transformers via Enhanced Activation Smoothing

Scaling Laws For Diffusion Transformers

EDT: An Efficient Diffusion Transformer Framework Inspired by Human-like Sketching