Abstract:Text-guided non-rigid editing involves complex edits for input images, such as changing motion or compositions within their surroundings. Since it requires manipulating the input structure, existing methods often struggle with preserving object identity and background, particularly when combined with Stable Diffusion. In this work, we propose a training-free approach for non-rigid editing with Stable Diffusion, aimed at improving the identity preservation quality without compromising editability. Our approach comprises three stages: text optimization, latent inversion, and timestep-aware text injection sampling. Inspired by the success of Imagic, we employ their text optimization for smooth editing. Then, we introduce latent inversion to preserve the input image's identity without additional model fine-tuning. To fully utilize the input reconstruction ability of latent inversion, we suggest timestep-aware text injection sampling. This effectively retains the structure of the input image by injecting the source text prompt in early sampling steps and then transitioning to the target prompt in subsequent sampling steps. This strategic approach seamlessly harmonizes with text optimization, facilitating complex non-rigid edits to the input without losing the original identity. We demonstrate the effectiveness of our method in terms of identity preservation, editability, and aesthetic quality through extensive experiments.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of how to improve the preservation quality of object identity without sacrificing editability when using Stable Diffusion for non - rigid editing. Specifically, non - rigid editing involves complex modifications to the pose or composition of objects in the input image, such as changing the actions of objects or the surrounding environment, while keeping the background and the identity of the objects unchanged. #### Main challenges: 1. **Identity preservation problem**: Existing methods have difficulty in well - preserving the identity of objects when dealing with non - rigid editing, especially in combination with Stable Diffusion. 2. **Over - fitting and color distortion**: Model fine - tuning may lead to over - fitting and color distortion, affecting the editing effect. 3. **Structural distortion**: The attention mechanism may cause distortion of objects and composition structures, resulting in unnatural images. #### Solutions: To solve these problems, the author proposes a training - free method to achieve better identity preservation and editability through the following three stages: 1. **Text Optimization**: Draw on the successful experience of Imagic and use text optimization to achieve smooth editing. 2. **Latent Inversion**: Introduce latent inversion to preserve the identity of the input image without additional model fine - tuning. 3. **Timestep - aware Text Injection Sampling**: Inject the source text prompt in the early sampling steps and then gradually transition to the target text prompt, thereby effectively preserving the structure of the input image. Through these methods, the author hopes to preserve the identity of the original image and achieve the desired editing effect in complex non - rigid editing tasks. The experimental results show that this method is superior to other existing methods in terms of identity preservation, editability and aesthetic quality. ### Formula presentation: - The objective function of text optimization: \[ e_{\text{opt}}=\argmin_{e_{\text{tgt}}} \mathbb{E}_{t, z_0, \epsilon}\left[\left\|\epsilon - f_\theta(z_t, t, e_{\text{tgt}})\right\|^2\right] \] - The timestep - aware text injection strategy: \[ e_{\text{input}} = \begin{cases} e_{\text{src}}, & \text{if } n \leq t \leq T \\ e_{\text{int}}, & \text{otherwise} \end{cases} \] These formulas show the specific implementation methods of text optimization and timestep - aware text injection, ensuring identity preservation and structural consistency during the editing process.

Latent Inversion with Timestep-aware Sampling for Training-free Non-rigid Editing

Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models

Inversion-Free Image Editing with Natural Language

Dual-Schedule Inversion: Training- and Tuning-Free Inversion for Real Image Editing

Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing

TurboEdit: Instant text-based image editing

Tuning-Free Inversion-Enhanced Control for Consistent Image Editing

Wavelet-Guided Acceleration of Text Inversion in Diffusion-Based Image Editing

Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image Guidance

Null-text Inversion for Editing Real Images using Guided Diffusion Models

Negative-prompt Inversion: Fast Image Inversion for Editing with Text-guided Diffusion Models

Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing

Stable Flow: Vital Layers for Training-Free Image Editing

FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models

TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing

Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code

LASPA: Latent Spatial Alignment for Fast Training-free Single Image Editing

Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing

LocInv: Localization-aware Inversion for Text-Guided Image Editing

IterInv: Iterative Inversion for Pixel-Level T2I Models

SAVE: Protagonist Diversification with Structure Agnostic Video Editing