SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation

Thuan Hoang Nguyen,Anh Tran

2024-07-15

Abstract:Despite their ability to generate high-resolution and diverse images from text prompts, text-to-image diffusion models often suffer from slow iterative sampling processes. Model distillation is one of the most effective directions to accelerate these models. However, previous distillation methods fail to retain the generation quality while requiring a significant amount of images for training, either from real data or synthetically generated by the teacher model. In response to this limitation, we present a novel image-free distillation scheme named $\textbf{SwiftBrush}$. Drawing inspiration from text-to-3D synthesis, in which a 3D neural radiance field that aligns with the input prompt can be obtained from a 2D text-to-image diffusion prior via a specialized loss without the use of any 3D data ground-truth, our approach re-purposes that same loss for distilling a pretrained multi-step text-to-image model to a student network that can generate high-fidelity images with just a single inference step. In spite of its simplicity, our model stands as one of the first one-step text-to-image generators that can produce images of comparable quality to Stable Diffusion without reliance on any training image data. Remarkably, SwiftBrush achieves an FID score of $\textbf{16.67}$ and a CLIP score of $\textbf{0.29}$ on the COCO-30K benchmark, achieving competitive results or even substantially surpassing existing state-of-the-art distillation techniques.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the issue of slow iterative sampling processes in text-to-image diffusion models when generating high-quality images. Although these models can produce high-resolution and diverse images, their iterative sampling process results in slower inference speeds, limiting their deployment on consumer devices. To tackle this problem, the researchers propose a new method called SwiftBrush. This method accelerates model inference through a novel distillation scheme, enabling the generation of high-quality images in a single inference step. The main features of SwiftBrush include: 1. **No Image Supervision Required**: Unlike previous methods that rely on large amounts of image data for training, SwiftBrush only requires text prompts for training, without any real image data. 2. **Efficiency**: SwiftBrush can generate high-fidelity images in a single inference step, significantly improving generation speed, approximately 20 times faster than existing Stable Diffusion models. 3. **High-Quality Generation Results**: Despite simplifying the training process, SwiftBrush can still generate images of comparable quality to Stable Diffusion and achieve competitive results in multiple benchmark tests. Through these improvements, SwiftBrush not only enhances generation speed but also maintains high generation quality, thereby addressing the limitations of existing diffusion models in terms of inference speed.

SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation

SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance

InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation

SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

Diffusion Brush: A Latent Diffusion Model-based Editing Tool for AI-generated Images

Long and Short Guidance in Score identity Distillation for One-Step Text-to-Image Generation

Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models

DiffSketcher: Text Guided Vector Sketch Synthesis Through Latent Diffusion Models

Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

One-Step Diffusion Distillation through Score Implicit Matching

Sketch-Guided Text-to-Image Diffusion Models

Improved Distribution Matching Distillation for Fast Image Synthesis

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

Diffusion Self-Distillation for Zero-Shot Customized Image Generation

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

One-step Diffusion with Distribution Matching Distillation

ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation