Abstract:In this work, we address two limitations of existing conditional diffusion models: their slow inference speed due to the iterative denoising process and their reliance on paired data for model fine-tuning. To tackle these issues, we introduce a general method for adapting a single-step diffusion model to new tasks and domains through adversarial learning objectives. Specifically, we consolidate various modules of the vanilla latent diffusion model into a single end-to-end generator network with small trainable weights, enhancing its ability to preserve the input image structure while reducing overfitting. We demonstrate that, for unpaired settings, our model CycleGAN-Turbo outperforms existing GAN-based and diffusion-based methods for various scene translation tasks, such as day-to-night conversion and adding/removing weather effects like fog, snow, and rain. We extend our method to paired settings, where our model pix2pix-Turbo is on par with recent works like Control-Net for Sketch2Photo and Edge2Image, but with a single-step inference. This work suggests that single-step diffusion models can serve as strong backbones for a range of GAN learning objectives. Our code and models are available at

What problem does this paper attempt to address?

This paper attempts to solve two main problems of existing conditional diffusion models: 1. **Slow inference speed**: Since the existing conditional diffusion models adopt an iterative denoising process, they are slow in generating images, which limits their use in real - time applications, such as interactive Sketch2Photo. 2. **Dependence on paired data for fine - tuning**: These models usually require a large number of paired data sets for training or fine - tuning, which not only increases the cost for many application scenarios, but is even infeasible for some scenarios. To solve these problems, the paper introduces a new method to adapt the single - step diffusion model through an adversarial learning objective, enabling it to be applied to new tasks and domains. Specifically, the paper proposes the following innovations: - **Integration module**: Integrate each module (encoder, UNet, decoder) in the traditional latent diffusion model into an end - to - end generation network, and introduce a small number of trainable weights to enhance its ability to maintain the structure of the input image while reducing overfitting. - **Adversarial learning**: Through the adversarial learning objective, the model can be trained without paired data. - **Preserving high - frequency details**: By introducing skip connections and zero - conv layers between the encoder and decoder, the high - frequency details of the input image are preserved. - **Single - step inference**: Without sacrificing image quality, the inference steps are reduced from multiple steps to one step, significantly improving the inference speed. The paper shows that this method performs better than existing GAN - based and diffusion - based methods in unpaired settings, especially in various scene conversion tasks, such as day - to - night conversion and adding/removing weather effects (such as fog, snow, rain). In addition, the paper also extends this method to paired settings, such as from sketch to photo (Sketch2Photo) and from edge to image (Edge2Image), and achieves results comparable to recent work on these tasks, but with faster inference speed.

One-Step Image Translation with Text-to-Image Models

A one-to-many conditional generative adversarial network framework for multiple image-to-image translations

Palette: Image-to-Image Diffusion Models

Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis

Unpaired Image-to-Image Translation with Diffusion Adversarial Network

Distilling Diffusion Models into Conditional GANs

Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

Seed-to-Seed: Image Translation in Diffusion Seed Space

Image Translation with Dual‐directional Generative Adversarial Networks

SingleGAN: Image-to-Image Translation by a Single-Generator Network Using Multiple Generative Adversarial Learning.

Diffusion-Based Conditional Image Editing through Optimized Inference with Guidance

You Only Sample Once: Taming One-Step Text-to-Image Synthesis by Self-Cooperative Diffusion GANs

Image-to-image Translation Using an Offset-Based Multi-Scale Codes GAN Encoder

One Diffusion to Generate Them All

Image Translation with Attention Mechanism Based on Generative Adversarial Networks.

E$^{2}$GAN: Efficient Training of Efficient GANs for Image-to-Image Translation

Conditional Image-to-Image Translation

Unified Generative Adversarial Networks for Controllable Image-to-Image Translation

GANTASTIC: GAN-based Transfer of Interpretable Directions for Disentangled Image Editing in Text-to-Image Diffusion Models

Diffusion-Adapter: Text Guided Image Manipulation with Frozen Diffusion Models

PMSGAN: Parallel Multistage GANs for Face Image Translation