Latent Space Editing in Transformer-Based Flow Matching

Vincent Tao Hu,David W Zhang,Pascal Mettes,Meng Tang,Deli Zhao,Cees G.M. Snoek
2023-12-18
Abstract:This paper strives for image editing via generative models. Flow Matching is an emerging generative modeling technique that offers the advantage of simple and efficient training. Simultaneously, a new transformer-based U-ViT has recently been proposed to replace the commonly used UNet for better scalability and performance in generative modeling. Hence, Flow Matching with a transformer backbone offers the potential for scalable and high-quality generative modeling, but their latent structure and editing ability are as of yet unknown. Hence, we adopt this setting and explore how to edit images through latent space manipulation. We introduce an editing space, which we call $u$-space, that can be manipulated in a controllable, accumulative, and composable manner. Additionally, we propose a tailored sampling solution to enable sampling with the more efficient adaptive step-size ODE solvers. Lastly, we put forth a straightforward yet powerful method for achieving fine-grained and nuanced editing using text prompts. Our framework is simple and efficient, all while being highly effective at editing images while preserving the essence of the original content. Our code will be publicly available at <a class="link-external link-https" href="https://taohu.me/lfm/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper aims to address several key issues in the field of image editing and proposes a Flow Matching method based on the Transformer architecture to achieve efficient and controllable image editing. Specifically: 1. **Exploring Latent Space Editing under Transformer Architecture**: The paper attempts to introduce the Transformer architecture (U-ViT) into the latest generative model technology—Flow Matching, exploring how to edit images by manipulating the latent space. Existing research mostly focuses on the traditional UNet architecture, with insufficient understanding of the latent space structure and editing capabilities of the Transformer architecture. 2. **Proposing a New Editing Space (u-space)**: To achieve controllable, cumulative, and composable image editing functions, the authors define a new editing space "u-space" and demonstrate how to manipulate semantic directions within this space. Unlike the traditional "h-space" in UNet, u-space is located at the beginning of the U-ViT architecture. 3. **Improving the Sampling Process**: To address the inconsistency between the forward and backward processes in Flow Matching, a method of semantic direction interpolation during the sampling process is proposed, allowing the use of more efficient adaptive step-size ODE solvers for editing. 4. **Local Prompt Editing under Text Conditions**: Further exploration is conducted on how to achieve local image editing by modifying text prompts. Compared to the prompt-to-prompt method, the proposed solution is more intuitive and straightforward, enabling the enhancement or weakening of specific attributes by adjusting attention weights. In summary, the main goal of this paper is to develop an efficient and user-friendly image editing method within the Flow Matching framework, allowing users to achieve precise control and editing of images through simple operations.