EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models

Yefei He,Jing Liu,Weijia Wu,Hong Zhou,Bohan Zhuang
2024-04-13
Abstract:Diffusion models have demonstrated remarkable capabilities in image synthesis and related generative tasks. Nevertheless, their practicality for real-world applications is constrained by substantial computational costs and latency issues. Quantization is a dominant way to compress and accelerate diffusion models, where post-training quantization (PTQ) and quantization-aware training (QAT) are two main approaches, each bearing its own properties. While PTQ exhibits efficiency in terms of both time and data usage, it may lead to diminished performance in low bit-width. On the other hand, QAT can alleviate performance degradation but comes with substantial demands on computational and data resources. In this paper, we introduce a data-free and parameter-efficient fine-tuning framework for low-bit diffusion models, dubbed EfficientDM, to achieve QAT-level performance with PTQ-like efficiency. Specifically, we propose a quantization-aware variant of the low-rank adapter (QALoRA) that can be merged with model weights and jointly quantized to low bit-width. The fine-tuning process distills the denoising capabilities of the full-precision model into its quantized counterpart, eliminating the requirement for training data. We also introduce scale-aware optimization and temporal learned step-size quantization to further enhance performance. Extensive experimental results demonstrate that our method significantly outperforms previous PTQ-based diffusion models while maintaining similar time and data efficiency. Specifically, there is only a 0.05 sFID increase when quantizing both weights and activations of LDM-4 to 4-bit on ImageNet 256x256. Compared to QAT-based methods, our EfficientDM also boasts a 16.2x faster quantization speed with comparable generation quality. Code is available at \href{<a class="link-external link-https" href="https://github.com/ThisisBillhe/EfficientDM" rel="external noopener nofollow">this https URL</a>}{this hrl}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the performance and efficiency of low - bit diffusion models in image generation tasks. Specifically, the paper focuses on how to reduce the computational cost and latency of the model through quantization techniques while maintaining or approaching the performance of the full - precision model, making it more suitable for low - latency practical application scenarios. ### Main Problems 1. **Computational Cost and Latency Issues**: Diffusion models perform well in image synthesis and other generation tasks, but their practical applications are limited by high computational cost and latency issues. The paper aims to solve these problems through effective quantization methods. 2. **Limitations of Quantization Methods**: - **Post - Training Quantization (PTQ)**: Although it is efficient and uses a small amount of data, it may lead to performance degradation in low - bit - width settings. - **Quantization - Aware Training (QAT)**: It can alleviate performance degradation, but it requires a large amount of computational resources and data, and the training time is long. ### Solutions In order to combine the advantages of PTQ and QAT while avoiding their respective disadvantages, the paper proposes a data - independent, quantization - aware, and parameter - efficient fine - tuning framework called EfficientDM. Specific contributions include: 1. **Quantization - Aware Low - Rank Adapter (QALoRA)**: - A quantization - aware low - rank adapter is proposed, which can be merged with the model weights and jointly quantized to a low - bit width. - In this way, additional storage and computational overhead are reduced, achieving efficient bit - wise operations. 2. **Scale - Aware LoRA Optimization**: - Scale - aware LoRA optimization is introduced to adapt to the changes in the quantization scale of weights between different layers, ensuring an effective optimization process. 3. **Temporal Learned Step - Size Quantization (TALSQ)**: - The activation - learned step - size quantization method is extended to the denoising time domain, effectively alleviating the quantization error caused by the change in the activation distribution in different denoising steps. ### Experimental Results The paper verifies the effectiveness of EfficientDM through extensive experiments. The main results are as follows: - **Unconditional Generation Task**: On the CIFAR - 10 dataset, EfficientDM achieves an FID of 3.80 at W4A8 precision, outperforming the existing QAT method TDQ (4.13). - **Conditional Generation Task**: On the ImageNet 256×256 dataset, EfficientDM can still maintain a low FID (6.17) and sFID (7.75) at W4A4 precision, while other PTQ methods are unable to complete image generation. - **Real - Time Acceleration**: The quantization speed of EfficientDM on LDM - 4 is 16.2 times faster than that of the QAT method, and the generation quality is comparable. In conclusion, through innovative quantization techniques and optimization methods, EfficientDM significantly improves the performance and efficiency of low - bit diffusion models, making them more competitive in practical applications.