TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models

Gilad Deutch,Rinon Gal,Daniel Garibi,Or Patashnik,Daniel Cohen-Or
2024-08-02
Abstract:Diffusion models have opened the path to a wide range of text-based image editing frameworks. However, these typically build on the multi-step nature of the diffusion backwards process, and adapting them to distilled, fast-sampling methods has proven surprisingly challenging. Here, we focus on a popular line of text-based editing frameworks - the ``edit-friendly'' DDPM-noise inversion approach. We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength. We trace the artifacts to mismatched noise statistics between inverted noises and the expected noise schedule, and suggest a shifted noise schedule which corrects for this offset. To increase editing strength, we propose a pseudo-guidance approach that efficiently increases the magnitude of edits without introducing new artifacts. All in all, our method enables text-based image editing with as few as three diffusion steps, while providing novel insights into the mechanisms behind popular text-based editing approaches.
Computer Vision and Pattern Recognition,Graphics
What problem does this paper attempt to address?
This paper is primarily dedicated to addressing the issues encountered when using fast sampling diffusion models for text-guided image editing, specifically including the appearance of visual artifacts and insufficient editing strength. ### Problems Addressed 1. **Visual Artifacts**: When traditional text-guided image editing methods based on multi-step diffusion processes are applied to fast sampling (e.g., 1-8 steps) diffusion models, noticeable visual artifacts are generated. 2. **Insufficient Editing Strength**: In fast sampling scenarios, even if the text prompt describing the image is changed, the editing results often have a weak correlation with the new prompt, i.e., the editing effect is not significant. ### Method Overview To address the above challenges, the authors propose the following methods: - **Analyzing the Statistical Characteristics of the Noise Inversion Process**: The authors found that the noise mapping obtained through noise inversion deviates from the statistical characteristics of standard Gaussian noise and proposed a time step offset method to correct this deviation, thereby reducing or eliminating visual artifacts. - **Enhancing Editing Effects**: By enhancing key items in the editing process, similar to Classifier-Free Guidance (CFG), but using fewer network evaluation steps, to increase the impact of the editing prompt on the final image. - **Equivalence Analysis**: The authors also analyzed the equivalence between their method and the Delta Denoising Score (DDS) method, which not only provides an in-depth understanding of the reasons for the success of both methods but also reveals how to further improve editing efficiency. ### Experimental Results The experimental section demonstrates the effectiveness of the proposed method, including qualitative and quantitative results. Compared to various baseline methods, this method achieves faster editing speeds while maintaining or even improving editing quality. Additionally, by comparing the effects of different components, the importance of each component in the proposed method is verified.