Abstract:Leveraging the large generative prior of the flow transformer for tuning-free image editing requires authentic inversion to project the image into the model's domain and a flexible invariance control mechanism to preserve non-target contents. However, the prevailing diffusion inversion performs deficiently in flow-based models, and the invariance control cannot reconcile diverse rigid and non-rigid editing tasks. To address these, we systematically analyze the \textbf{inversion and invariance} control based on the flow transformer. Specifically, we unveil that the Euler inversion shares a similar structure to DDIM yet is more susceptible to the approximation error. Thus, we propose a two-stage inversion to first refine the velocity estimation and then compensate for the leftover error, which pivots closely to the model prior and benefits editing. Meanwhile, we propose the invariance control that manipulates the text features within the adaptive layer normalization, connecting the changes in the text prompt to image semantics. This mechanism can simultaneously preserve the non-target contents while allowing rigid and non-rigid manipulation, enabling a wide range of editing types such as visual text, quantity, facial expression, etc. Experiments on versatile scenarios validate that our framework achieves flexible and accurate editing, unlocking the potential of the flow transformer for versatile image editing.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are how to achieve effective inversion and invariance control in flow - transformer - based image editing. Specifically: 1. **Inversion problem**: - **Existing challenges**: The existing inversion methods in diffusion models (such as DDIM inversion) perform poorly in flow models. In particular, when using the Euler sampler, they are easily affected by approximation errors, causing the inversion results to deviate from the original image, thus affecting the editing effect. - **Solutions**: The authors propose a two - stage inversion method. In the first stage, the approximation error of velocity estimation is reduced through fixed - point iteration. In the second stage, a slight compensation is added in each denoising step to accurately restore the original image and enhance the editing ability. 2. **Invariance control problem**: - **Existing challenges**: Traditional attention mechanisms (such as self - attention and cross - attention) are difficult to coordinate rigid and non - rigid editing tasks. Especially when dealing with changes in layout, quantity, and pose, they cannot maintain the invariance of non - target content at the same time. - **Solutions**: The authors propose a flexible invariance control mechanism based on Adaptive Layer Normalization (AdaLN). By replacing unedited text features, it can allow rigid and non - rigid editing while maintaining non - target content, thus achieving diverse editing types. ### Formula summary - **Inversion formula**: \[ x_t = x_{t + 1}+(\sigma_t-\sigma_{t + 1})v_\theta(x_t,t)\quad\text{(ideal case)} \] \[ x_t\approx x_{t + 1}+(\sigma_t-\sigma_{t + 1})v_\theta(x_{t + 1},t)\quad\text{(approximate case)} \] - **Fixed - point iteration formula**: \[ x^{i + 1}_t=x_{t + 1}+(\sigma_t-\sigma_{t + 1})v_\theta(x^i_t,t) \] - **Velocity compensation formula**: \[ \hat{x}_{t + 1}=x_t+(\sigma_{t + 1}-\sigma_t)v_\theta(x_t,t) \] \[ \epsilon_t=x_{t + 1}-\hat{x}_{t + 1} \] - **Invariance control in AdaLN**: \[ \text{Map}(M_s,M_t,P_s,P_t):= \begin{cases} \hat{M}_t&\text{if }t < S\\ M_t&\text{otherwise} \end{cases} \] ### Conclusion This paper proposes an efficient and flexible image editing framework by systematically analyzing and improving inversion and invariance control in flow transformers. The experimental results show that this method performs well in multiple editing scenarios, can achieve high - quality image editing, and maintain the invariance of non - target content at the same time.

Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing

FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models

Stable Flow: Vital Layers for Training-Free Image Editing

Flow-Guided Transformer for Video Inpainting

Latent Space Editing in Transformer-Based Flow Matching

Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing

Taming Rectified Flow for Inversion and Editing

Inversion-Free Image Editing with Natural Language

FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing

Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models

Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code

Oscillation Inversion: Understand the structure of Large Flow Model through the Lens of Inversion Method

Text-to-Image Rectified Flow as Plug-and-Play Priors

Exploiting Optical Flow Guidance for Transformer-Based Video Inpainting

Diverse Image Inpainting with Normalizing Flow.

Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations

Flow Video Synthesis from an Image.

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

TransFlow: Transformer as Flow Learner

Latent Inversion with Timestep-aware Sampling for Training-free Non-rigid Editing

FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers