Progressive Knowledge Distillation Of Stable Diffusion XL Using Layer Level Loss

Yatharth Gupta,Vishnu V. Jaddipal,Harish Prabhala,Sayak Paul,Patrick Von Platen
2024-01-05
Abstract:Stable Diffusion XL (SDXL) has become the best open source text-to-image model (T2I) for its versatility and top-notch image quality. Efficiently addressing the computational demands of SDXL models is crucial for wider reach and applicability. In this work, we introduce two scaled-down variants, Segmind Stable Diffusion (SSD-1B) and Segmind-Vega, with 1.3B and 0.74B parameter UNets, respectively, achieved through progressive removal using layer-level losses focusing on reducing the model size while preserving generative quality. We release these models weights at
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper mainly addresses the following issues: ### Research Background and Objectives - **Background**: Stable Diffusion XL (SDXL), as an advanced Text-to-Image (T2I) generation model, excels in image quality and diversity. However, its large model size limits its application in resource-constrained environments. - **Objective**: The research aims to reduce the parameter count and computational complexity of the SDXL model through knowledge distillation techniques, enhancing its practicality and deployability in various application scenarios. ### Specific Issues 1. **Model Compression**: How to effectively compress the SDXL model, significantly reducing its parameter count (up to 70%) while maintaining high image generation quality. 2. **Knowledge Retention**: How to ensure that the compressed model retains the key features and performance of the original model, especially in generating images under text conditions. 3. **Efficient Training**: Exploring efficient training strategies using multi-stage knowledge distillation methods to enable the compressed model to quickly converge to a performance level close to the original model. ### Method Overview - **Compression Strategy**: Achieving model compression by removing specific residual networks and transformer blocks in the U-Net structure and using Layer Level Loss for training to finely retain important features. - **Knowledge Distillation**: Employing output-level and feature-level knowledge distillation techniques to allow the smaller student model to learn from the larger teacher model, ensuring the quality of generated images. - **Multi-Stage Training**: Using multiple pre-trained teacher models for phased training to gradually enhance the student model's capabilities. ### Main Contributions 1. **Model Compression**: Proposed two compressed versions—Segmind Stable Diffusion (SSD-1B) and Segmind-Vega, with U-Nets having 1.3B and 0.74B parameters respectively, achieving up to 70% reduction in parameter count. 2. **Knowledge Transfer**: Effectively retained the critical information of the original model through feature-level and output-level knowledge distillation, making the compressed model's image generation quality close to or even better than the original model. 3. **Practical Application Value**: The compressed model not only improved computational efficiency (up to 100% speed increase) but also received user preference in image quality evaluations, demonstrating its potential for practical deployment. In summary, this research addresses the computational resource limitations of the SDXL model in practical applications through a series of innovative methods, providing new directions for the development of Text-to-Image generation technology.