Abstract:Stable Diffusion XL (SDXL) has become the best open source text-to-image model (T2I) for its versatility and top-notch image quality. Efficiently addressing the computational demands of SDXL models is crucial for wider reach and applicability. In this work, we introduce two scaled-down variants, Segmind Stable Diffusion (SSD-1B) and Segmind-Vega, with 1.3B and 0.74B parameter UNets, respectively, achieved through progressive removal using layer-level losses focusing on reducing the model size while preserving generative quality. We release these models weights at

What problem does this paper attempt to address?

The paper mainly addresses the following issues: ### Research Background and Objectives - **Background**: Stable Diffusion XL (SDXL), as an advanced Text-to-Image (T2I) generation model, excels in image quality and diversity. However, its large model size limits its application in resource-constrained environments. - **Objective**: The research aims to reduce the parameter count and computational complexity of the SDXL model through knowledge distillation techniques, enhancing its practicality and deployability in various application scenarios. ### Specific Issues 1. **Model Compression**: How to effectively compress the SDXL model, significantly reducing its parameter count (up to 70%) while maintaining high image generation quality. 2. **Knowledge Retention**: How to ensure that the compressed model retains the key features and performance of the original model, especially in generating images under text conditions. 3. **Efficient Training**: Exploring efficient training strategies using multi-stage knowledge distillation methods to enable the compressed model to quickly converge to a performance level close to the original model. ### Method Overview - **Compression Strategy**: Achieving model compression by removing specific residual networks and transformer blocks in the U-Net structure and using Layer Level Loss for training to finely retain important features. - **Knowledge Distillation**: Employing output-level and feature-level knowledge distillation techniques to allow the smaller student model to learn from the larger teacher model, ensuring the quality of generated images. - **Multi-Stage Training**: Using multiple pre-trained teacher models for phased training to gradually enhance the student model's capabilities. ### Main Contributions 1. **Model Compression**: Proposed two compressed versions—Segmind Stable Diffusion (SSD-1B) and Segmind-Vega, with U-Nets having 1.3B and 0.74B parameters respectively, achieving up to 70% reduction in parameter count. 2. **Knowledge Transfer**: Effectively retained the critical information of the original model through feature-level and output-level knowledge distillation, making the compressed model's image generation quality close to or even better than the original model. 3. **Practical Application Value**: The compressed model not only improved computational efficiency (up to 100% speed increase) but also received user preference in image quality evaluations, demonstrating its potential for practical deployment. In summary, this research addresses the computational resource limitations of the SDXL model in practical applications through a series of innovative methods, providing new directions for the development of Text-to-Image generation technology.

Progressive Knowledge Distillation Of Stable Diffusion XL Using Layer Level Loss

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

SDXL-Lightning: Progressive Adversarial Diffusion Distillation

BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion

KOALA: Empirical Lessons Toward Memory-Efficient and Fast Diffusion Models for Text-to-Image Synthesis

A-SDM: Accelerating Stable Diffusion through Redundancy Removal and Performance Optimization

Not All Steps Are Created Equal: Selective Diffusion Distillation for Image Manipulation

LAPTOP-Diff: Layer Pruning and Normalized Distillation for Compressing Diffusion Models

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

On the Scalability of Diffusion-based Text-to-Image Generation

DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance

Isolated Diffusion: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models

Versatile Diffusion: Text, Images and Variations All in One Diffusion Model

Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis

Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets