Abstract:Large-scale text-to-image diffusion models have been a ground-breaking development in generating convincing images following an input text prompt. The goal of image editing research is to give users control over the generated images by modifying the text prompt. Current image editing techniques predominantly hinge on DDIM inversion as a prevalent practice rooted in Latent Diffusion Models (LDM). However, the large pretrained T2I models working on the latent space suffer from losing details due to the first compression stage with an autoencoder mechanism. Instead, other mainstream T2I pipeline working on the pixel level, such as Imagen and DeepFloyd-IF, circumvents the above problem. They are commonly composed of multiple stages, typically starting with a text-to-image stage and followed by several super-resolution stages. In this pipeline, the DDIM inversion fails to find the initial noise and generate the original image given that the super-resolution diffusion models are not compatible with the DDIM technique. According to our experimental findings, iteratively concatenating the noisy image as the condition is the root of this problem. Based on this observation, we develop an iterative inversion (IterInv) technique for this category of T2I models and verify IterInv with the open-source DeepFloyd-IF model.Specifically, IterInv employ NTI as the inversion and reconstruction of low-resolution image generation. In stages 2 and 3, we update the latent variance at each timestep to find the deterministic inversion trace and promote the reconstruction process. By combining our method with a popular image editing method, we prove the application prospects of IterInv. The code will be released upon acceptance. The code is available at \url{<a class="link-external link-https" href="https://github.com/Tchuanm/IterInv.git" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The paper primarily focuses on addressing the issue of image editing in text-to-image (T2I) models, particularly the inversion techniques for pixel-level T2I models. Existing methods based on Latent Diffusion Models (LDM) tend to lose details during the compression stage when handling high-resolution images, whereas methods that generate images directly in the pixel space (such as Imagen and DeepFloyd-IF) avoid this problem. However, the application of these methods in image editing has not been fully explored. The main contribution of the paper is the proposal of an iterative inversion (IterInv) technique to address the issues encountered when applying DDIM inversion techniques in pixel-level T2I models. Specifically, when DDIM inversion is applied to the DeepFloyd-IF model, the conditional connection operations in the super-resolution diffusion model prevent accurate reconstruction of the original image. IterInv tracks the diffusion process through iterative optimization to approximate the real image and can achieve this goal at multiple stages. Experimental results show that IterInv significantly outperforms traditional methods in terms of image reconstruction quality and subsequent editing capabilities. Additionally, the paper demonstrates the application prospects of combining IterInv with the DiffEdit method, achieving text-guided image editing functionality. Although IterInv is currently limited to the study of inversion problems using the open-source DeepFloyd-IF model, its excellent performance suggests the potential for further expansion of its applicability in the future.

IterInv: Iterative Inversion for Pixel-Level T2I Models

LocInv: Localization-aware Inversion for Text-Guided Image Editing

Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing

EDICT: Exact Diffusion Inversion via Coupled Transformations

SimInversion: A Simple Framework for Inversion-Based Text-to-Image Editing

TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing

Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models

TurboEdit: Instant text-based image editing

Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing

Null-text Inversion for Editing Real Images using Guided Diffusion Models

Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models

Lightning-Fast Image Inversion and Editing for Text-to-Image Diffusion Models

Inversion-Free Image Editing with Natural Language

Oscillation Inversion: Understand the structure of Large Flow Model through the Lens of Inversion Method

Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code

Latent Inversion with Timestep-aware Sampling for Training-free Non-rigid Editing

ReNoise: Real Image Inversion Through Iterative Noising

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

Regularized Newton Raphson Inversion for Text-to-Image Diffusion Models

EasyInv: Toward Fast and Better DDIM Inversion

FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation