Latent Inversion with Timestep-aware Sampling for Training-free Non-rigid Editing

Yunji Jung,Seokju Lee,Tair Djanibekov,Hyunjung Shim,Jong Chul Ye
2024-10-16
Abstract:Text-guided non-rigid editing involves complex edits for input images, such as changing motion or compositions within their surroundings. Since it requires manipulating the input structure, existing methods often struggle with preserving object identity and background, particularly when combined with Stable Diffusion. In this work, we propose a training-free approach for non-rigid editing with Stable Diffusion, aimed at improving the identity preservation quality without compromising editability. Our approach comprises three stages: text optimization, latent inversion, and timestep-aware text injection sampling. Inspired by the success of Imagic, we employ their text optimization for smooth editing. Then, we introduce latent inversion to preserve the input image's identity without additional model fine-tuning. To fully utilize the input reconstruction ability of latent inversion, we suggest timestep-aware text injection sampling. This effectively retains the structure of the input image by injecting the source text prompt in early sampling steps and then transitioning to the target prompt in subsequent sampling steps. This strategic approach seamlessly harmonizes with text optimization, facilitating complex non-rigid edits to the input without losing the original identity. We demonstrate the effectiveness of our method in terms of identity preservation, editability, and aesthetic quality through extensive experiments.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of how to improve the preservation quality of object identity without sacrificing editability when using Stable Diffusion for non - rigid editing. Specifically, non - rigid editing involves complex modifications to the pose or composition of objects in the input image, such as changing the actions of objects or the surrounding environment, while keeping the background and the identity of the objects unchanged. #### Main challenges: 1. **Identity preservation problem**: Existing methods have difficulty in well - preserving the identity of objects when dealing with non - rigid editing, especially in combination with Stable Diffusion. 2. **Over - fitting and color distortion**: Model fine - tuning may lead to over - fitting and color distortion, affecting the editing effect. 3. **Structural distortion**: The attention mechanism may cause distortion of objects and composition structures, resulting in unnatural images. #### Solutions: To solve these problems, the author proposes a training - free method to achieve better identity preservation and editability through the following three stages: 1. **Text Optimization**: Draw on the successful experience of Imagic and use text optimization to achieve smooth editing. 2. **Latent Inversion**: Introduce latent inversion to preserve the identity of the input image without additional model fine - tuning. 3. **Timestep - aware Text Injection Sampling**: Inject the source text prompt in the early sampling steps and then gradually transition to the target text prompt, thereby effectively preserving the structure of the input image. Through these methods, the author hopes to preserve the identity of the original image and achieve the desired editing effect in complex non - rigid editing tasks. The experimental results show that this method is superior to other existing methods in terms of identity preservation, editability and aesthetic quality. ### Formula presentation: - The objective function of text optimization: \[ e_{\text{opt}}=\argmin_{e_{\text{tgt}}} \mathbb{E}_{t, z_0, \epsilon}\left[\left\|\epsilon - f_\theta(z_t, t, e_{\text{tgt}})\right\|^2\right] \] - The timestep - aware text injection strategy: \[ e_{\text{input}} = \begin{cases} e_{\text{src}}, & \text{if } n \leq t \leq T \\ e_{\text{int}}, & \text{otherwise} \end{cases} \] These formulas show the specific implementation methods of text optimization and timestep - aware text injection, ensuring identity preservation and structural consistency during the editing process.