Keith G. Mills,Mohammad Salameh,Ruichen Chen,Negar Hassanpour,Wei Lu,Di Niu
Abstract:Diffusion Models (DM) have democratized AI image generation through an iterative denoising process. Quantization is a major technique to alleviate the inference cost and reduce the size of DM denoiser networks. However, as denoisers evolve from variants of convolutional U-Nets toward newer Transformer architectures, it is of growing importance to understand the quantization sensitivity of different weight layers, operations and architecture types to performance. In this work, we address this challenge with Qua$^2$SeDiMo, a mixed-precision Post-Training Quantization framework that generates explainable insights on the cost-effectiveness of various model weight quantization methods for different denoiser operation types and block structures. We leverage these insights to make high-quality mixed-precision quantization decisions for a myriad of diffusion models ranging from foundational U-Nets to state-of-the-art Transformers. As a result, Qua$^2$SeDiMo can construct 3.4-bit, 3.9-bit, 3.65-bit and 3.7-bit weight quantization on PixArt-${\alpha}$, PixArt-${\Sigma}$, Hunyuan-DiT and SDXL, respectively. We further pair our weight-quantization configurations with 6-bit activation quantization and outperform existing approaches in terms of quantitative metrics and generative image quality.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to perform sub - 4 - bit quantization on the denoiser network in Diffusion Models (DM) to reduce computational cost and model size without significantly degrading the image generation quality.
Specifically, the authors proposed a framework named Qua2SeDiMo (Quantifiable Quantization Sensitivity of Diffusion Models), aiming to optimize the quantization configurations of different weight layers, operation types, and architecture types in diffusion models through mixed - precision Post - Training Quantization (PTQ). The following are the main contributions of the paper:
1. **Directly Relate Quantization Methods and Bit - Precision to End - to - End Performance**:
- Proposed a method to directly relate the quantization method and bit - precision of each layer (operation) to end - to - end network metrics such as model size or task performance.
- Learned to assign the optimal configuration for each layer by evaluating less than 500 sampled quantization configurations.
2. **Reveal the Quantization Sensitivity of Specific Model Layers and Blocks**:
- Found that U - Nets prefer uniform scale quantization, while DiT models prefer clustering - based methods.
- Pointed out that the ResNet blocks in U - Nets are more sensitive to quantization and require higher bit - precision to maintain end - to - end performance and image quality.
- Found that the final output layer of the DiT model is more sensitive to quantization.
3. **Construct an Efficient Mixed - Precision Quantization Configuration**:
- Achieved 3.4, 3.9, 3.65, 3.7, and 3.5 - bit PTQ on models such as PixArt - α, PixArt - Σ, Hunyuan - DiT, SDXL, and DiT - XL/2 respectively without a calibration dataset.
- Combined weight quantization with activation quantization, surpassing existing techniques such as Q - Diffusion, TFMQ - DM, and ViDiT - Q, and performing better in terms of visual quality, FID, and CLIP scores.
Through these contributions, Qua2SeDiMo not only addresses the challenges of low - bit quantization in diffusion models but also provides valuable insights into the quantization sensitivity of different operations and architecture types, thus helping researchers and engineers design and optimize diffusion models more effectively.
### Related Formulas
1. **K - Means Clustering Quantization**:
\[
W_Q=\text{indices corresponding to }K = 2^{N_Q}\text{ cluster centroids}
\]
where \( N_Q \) is the bit - precision after quantization, and \( W_Q \) is the quantized weight matrix.
2. **Uniform Affine Quantization (UAQ)**:
\[
\Delta=\frac{\max(|W_{FP}|)}{2^{N_Q}- 1}
\]
\[
W_Q=\text{clamp}\left(\left\lfloor\frac{W_{FP}}{\Delta}\right\rfloor, - 2^{N_Q - 1}+1,2^{N_Q - 1}-1\right)
\]
3. **Adjusted Quantization Step**:
\[
\Delta_\alpha=\frac{\max(|W_{FP}|)\cdot(1 - 0.01\alpha)}{2^{N_Q}-1}
\]
where \( \alpha\in[0,100)\) is used to minimize \( L_p \) loss:
\[
\min_\alpha\|W_{FP}-\Delta_\alpha W_Q\|_p
\]
These formulas show different quantization methods and their impact on model performance, further supporting the Qua2SeDiMo framework.