Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing

Pengcheng Xu,Boyuan Jiang,Xiaobin Hu,Donghao Luo,Qingdong He,Jiangning Zhang,Chengjie Wang,Yunsheng Wu,Charles Ling,Boyu Wang
2024-11-24
Abstract:Leveraging the large generative prior of the flow transformer for tuning-free image editing requires authentic inversion to project the image into the model's domain and a flexible invariance control mechanism to preserve non-target contents. However, the prevailing diffusion inversion performs deficiently in flow-based models, and the invariance control cannot reconcile diverse rigid and non-rigid editing tasks. To address these, we systematically analyze the \textbf{inversion and invariance} control based on the flow transformer. Specifically, we unveil that the Euler inversion shares a similar structure to DDIM yet is more susceptible to the approximation error. Thus, we propose a two-stage inversion to first refine the velocity estimation and then compensate for the leftover error, which pivots closely to the model prior and benefits editing. Meanwhile, we propose the invariance control that manipulates the text features within the adaptive layer normalization, connecting the changes in the text prompt to image semantics. This mechanism can simultaneously preserve the non-target contents while allowing rigid and non-rigid manipulation, enabling a wide range of editing types such as visual text, quantity, facial expression, etc. Experiments on versatile scenarios validate that our framework achieves flexible and accurate editing, unlocking the potential of the flow transformer for versatile image editing.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are how to achieve effective inversion and invariance control in flow - transformer - based image editing. Specifically: 1. **Inversion problem**: - **Existing challenges**: The existing inversion methods in diffusion models (such as DDIM inversion) perform poorly in flow models. In particular, when using the Euler sampler, they are easily affected by approximation errors, causing the inversion results to deviate from the original image, thus affecting the editing effect. - **Solutions**: The authors propose a two - stage inversion method. In the first stage, the approximation error of velocity estimation is reduced through fixed - point iteration. In the second stage, a slight compensation is added in each denoising step to accurately restore the original image and enhance the editing ability. 2. **Invariance control problem**: - **Existing challenges**: Traditional attention mechanisms (such as self - attention and cross - attention) are difficult to coordinate rigid and non - rigid editing tasks. Especially when dealing with changes in layout, quantity, and pose, they cannot maintain the invariance of non - target content at the same time. - **Solutions**: The authors propose a flexible invariance control mechanism based on Adaptive Layer Normalization (AdaLN). By replacing unedited text features, it can allow rigid and non - rigid editing while maintaining non - target content, thus achieving diverse editing types. ### Formula summary - **Inversion formula**: \[ x_t = x_{t + 1}+(\sigma_t-\sigma_{t + 1})v_\theta(x_t,t)\quad\text{(ideal case)} \] \[ x_t\approx x_{t + 1}+(\sigma_t-\sigma_{t + 1})v_\theta(x_{t + 1},t)\quad\text{(approximate case)} \] - **Fixed - point iteration formula**: \[ x^{i + 1}_t=x_{t + 1}+(\sigma_t-\sigma_{t + 1})v_\theta(x^i_t,t) \] - **Velocity compensation formula**: \[ \hat{x}_{t + 1}=x_t+(\sigma_{t + 1}-\sigma_t)v_\theta(x_t,t) \] \[ \epsilon_t=x_{t + 1}-\hat{x}_{t + 1} \] - **Invariance control in AdaLN**: \[ \text{Map}(M_s,M_t,P_s,P_t):= \begin{cases} \hat{M}_t&\text{if }t < S\\ M_t&\text{otherwise} \end{cases} \] ### Conclusion This paper proposes an efficient and flexible image editing framework by systematically analyzing and improving inversion and invariance control in flow transformers. The experimental results show that this method performs well in multiple editing scenarios, can achieve high - quality image editing, and maintain the invariance of non - target content at the same time.