Abstract:This paper strives for image editing via generative models. Flow Matching is an emerging generative modeling technique that offers the advantage of simple and efficient training. Simultaneously, a new transformer-based U-ViT has recently been proposed to replace the commonly used UNet for better scalability and performance in generative modeling. Hence, Flow Matching with a transformer backbone offers the potential for scalable and high-quality generative modeling, but their latent structure and editing ability are as of yet unknown. Hence, we adopt this setting and explore how to edit images through latent space manipulation. We introduce an editing space, which we call $u$-space, that can be manipulated in a controllable, accumulative, and composable manner. Additionally, we propose a tailored sampling solution to enable sampling with the more efficient adaptive step-size ODE solvers. Lastly, we put forth a straightforward yet powerful method for achieving fine-grained and nuanced editing using text prompts. Our framework is simple and efficient, all while being highly effective at editing images while preserving the essence of the original content. Our code will be publicly available at <a class="link-external link-https" href="https://taohu.me/lfm/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The paper aims to address several key issues in the field of image editing and proposes a Flow Matching method based on the Transformer architecture to achieve efficient and controllable image editing. Specifically: 1. **Exploring Latent Space Editing under Transformer Architecture**: The paper attempts to introduce the Transformer architecture (U-ViT) into the latest generative model technology—Flow Matching, exploring how to edit images by manipulating the latent space. Existing research mostly focuses on the traditional UNet architecture, with insufficient understanding of the latent space structure and editing capabilities of the Transformer architecture. 2. **Proposing a New Editing Space (u-space)**: To achieve controllable, cumulative, and composable image editing functions, the authors define a new editing space "u-space" and demonstrate how to manipulate semantic directions within this space. Unlike the traditional "h-space" in UNet, u-space is located at the beginning of the U-ViT architecture. 3. **Improving the Sampling Process**: To address the inconsistency between the forward and backward processes in Flow Matching, a method of semantic direction interpolation during the sampling process is proposed, allowing the use of more efficient adaptive step-size ODE solvers for editing. 4. **Local Prompt Editing under Text Conditions**: Further exploration is conducted on how to achieve local image editing by modifying text prompts. Compared to the prompt-to-prompt method, the proposed solution is more intuitive and straightforward, enabling the enhancement or weakening of specific attributes by adjusting attention weights. In summary, the main goal of this paper is to develop an efficient and user-friendly image editing method within the Flow Matching framework, allowing users to achieve precise control and editing of images through simple operations.

Latent Space Editing in Transformer-Based Flow Matching

Flow Matching in Latent Space

Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing

Coordinate In and Value Out: Training Flow Transformers in Ambient Space

FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers

Stable Flow: Vital Layers for Training-Free Image Editing

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Semantic Latent Decomposition with Normalizing Flows for Face Editing.

FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models

User‐Controllable Latent Transformer for StyleGAN Image Layout Editing

Convergence Analysis of Flow Matching in Latent Space with Transformers

Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Fine-grained Semantic Editing

Motion Flow Matching for Human Motion Synthesis and Editing

A Latent Transformer for Disentangled Face Editing in Images and Videos

$S^2$-Flow: Joint Semantic and Style Editing of Facial Images

Semantic Facial Expression Editing using Autoencoded Flow

Latent Space Disentanglement in Diffusion Transformers Enables Precise Zero-shot Semantic Editing

Boosting Latent Diffusion with Flow Matching

FlowTurbo: Towards Real-time Flow-Based Image Generation with Velocity Refiner

Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry

Latte: Latent Diffusion Transformer for Video Generation