DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing

Yueru Jia,Yuhui Yuan,Aosong Cheng,Chuke Wang,Ji Li,Huizhu Jia,Shanghang Zhang
2024-03-21
Abstract:Recently, how to achieve precise image editing has attracted increasing attention, especially given the remarkable success of text-to-image generation models. To unify various spatial-aware image editing abilities into one framework, we adopt the concept of layers from the design domain to manipulate objects flexibly with various operations. The key insight is to transform the spatial-aware image editing task into a combination of two sub-tasks: multi-layered latent decomposition and multi-layered latent fusion. First, we segment the latent representations of the source images into multiple layers, which include several object layers and one incomplete background layer that necessitates reliable inpainting. To avoid extra tuning, we further explore the inner inpainting ability within the self-attention mechanism. We introduce a key-masking self-attention scheme that can propagate the surrounding context information into the masked region while mitigating its impact on the regions outside the mask. Second, we propose an instruction-guided latent fusion that pastes the multi-layered latent representations onto a canvas latent. We also introduce an artifact suppression scheme in the latent space to enhance the inpainting quality. Due to the inherent modular advantages of such multi-layered representations, we can achieve accurate image editing, and we demonstrate that our approach consistently surpasses the latest spatial editing methods, including Self-Guidance and DiffEditor. Last, we show that our approach is a unified framework that supports various accurate image editing tasks on more than six different editing tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of **precise spatially - aware image editing**, especially how to achieve a unified framework for multiple spatially - aware image editing tasks without additional training. Specifically, the authors propose a method named **DesignEdit**, which can flexibly edit different objects in an image without relying on specific training or back - propagation updates by introducing multi - layer latent decomposition and fusion techniques. #### Main challenges 1. **Precise spatially - aware editing**: Existing image generation models (such as text - to - image generation models) have limitations when dealing with prompts that require numerical or spatial arrangement capabilities. For example, the generated image may not match the text description, resulting in a difference between user expectations and the actual result. 2. **Simultaneous multi - object editing**: Traditional image editing methods usually need to combine multiple editing guidance designs and update the latent representation through additional back - propagation, which makes it difficult to perform different operations on different objects simultaneously. 3. **High - quality background repair**: After removing or moving an object, the repair quality of the background area is crucial. Existing methods may introduce artifacts or distortion during the repair process. #### Solutions To solve the above problems, the authors propose a **training - free, forward - propagation, unified framework**, which is achieved through the following steps: 1. **Multi - layer latent decomposition**: - Divide the latent representation of the source image into multiple levels, including multiple object layers and an incomplete background layer. - Introduce a new **key - masked self - attention mechanism** to ensure high - quality repair of the background area while avoiding affecting the outer area. 2. **Multi - layer latent fusion**: - Paste the multi - layer latent representation onto the canvas latent representation according to the target layout arrangement. - Propose an **artifact suppression scheme** to enhance the background repair quality. 3. **Instruction - guided editing**: - Utilize the reasoning and planning capabilities of GPT - 4V to convert the user's vague editing instructions into detailed hierarchical editing instructions. Through these techniques, the authors show that their method can significantly outperform existing spatial editing methods (such as Self - Guidance and DiffEditor), and support multiple complex image editing tasks, such as object removal, resizing, moving, repeating, flipping, camera panning, zooming, compositing multiple images, and editing text or decorations. #### Summary The main contribution of this paper is to provide a **unified and accurate spatially - aware image editing framework** that can efficiently handle multiple image editing tasks without additional training and significantly improve the editing quality and accuracy.