One-Step Image Translation with Text-to-Image Models

Gaurav Parmar,Taesung Park,Srinivasa Narasimhan,Jun-Yan Zhu
2024-03-19
Abstract:In this work, we address two limitations of existing conditional diffusion models: their slow inference speed due to the iterative denoising process and their reliance on paired data for model fine-tuning. To tackle these issues, we introduce a general method for adapting a single-step diffusion model to new tasks and domains through adversarial learning objectives. Specifically, we consolidate various modules of the vanilla latent diffusion model into a single end-to-end generator network with small trainable weights, enhancing its ability to preserve the input image structure while reducing overfitting. We demonstrate that, for unpaired settings, our model CycleGAN-Turbo outperforms existing GAN-based and diffusion-based methods for various scene translation tasks, such as day-to-night conversion and adding/removing weather effects like fog, snow, and rain. We extend our method to paired settings, where our model pix2pix-Turbo is on par with recent works like Control-Net for Sketch2Photo and Edge2Image, but with a single-step inference. This work suggests that single-step diffusion models can serve as strong backbones for a range of GAN learning objectives. Our code and models are available at
Computer Vision and Pattern Recognition,Graphics,Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve two main problems of existing conditional diffusion models: 1. **Slow inference speed**: Since the existing conditional diffusion models adopt an iterative denoising process, they are slow in generating images, which limits their use in real - time applications, such as interactive Sketch2Photo. 2. **Dependence on paired data for fine - tuning**: These models usually require a large number of paired data sets for training or fine - tuning, which not only increases the cost for many application scenarios, but is even infeasible for some scenarios. To solve these problems, the paper introduces a new method to adapt the single - step diffusion model through an adversarial learning objective, enabling it to be applied to new tasks and domains. Specifically, the paper proposes the following innovations: - **Integration module**: Integrate each module (encoder, UNet, decoder) in the traditional latent diffusion model into an end - to - end generation network, and introduce a small number of trainable weights to enhance its ability to maintain the structure of the input image while reducing overfitting. - **Adversarial learning**: Through the adversarial learning objective, the model can be trained without paired data. - **Preserving high - frequency details**: By introducing skip connections and zero - conv layers between the encoder and decoder, the high - frequency details of the input image are preserved. - **Single - step inference**: Without sacrificing image quality, the inference steps are reduced from multiple steps to one step, significantly improving the inference speed. The paper shows that this method performs better than existing GAN - based and diffusion - based methods in unpaired settings, especially in various scene conversion tasks, such as day - to - night conversion and adding/removing weather effects (such as fog, snow, rain). In addition, the paper also extends this method to paired settings, such as from sketch to photo (Sketch2Photo) and from edge to image (Edge2Image), and achieves results comparable to recent work on these tasks, but with faster inference speed.