Abstract:Recently, how to achieve precise image editing has attracted increasing attention, especially given the remarkable success of text-to-image generation models. To unify various spatial-aware image editing abilities into one framework, we adopt the concept of layers from the design domain to manipulate objects flexibly with various operations. The key insight is to transform the spatial-aware image editing task into a combination of two sub-tasks: multi-layered latent decomposition and multi-layered latent fusion. First, we segment the latent representations of the source images into multiple layers, which include several object layers and one incomplete background layer that necessitates reliable inpainting. To avoid extra tuning, we further explore the inner inpainting ability within the self-attention mechanism. We introduce a key-masking self-attention scheme that can propagate the surrounding context information into the masked region while mitigating its impact on the regions outside the mask. Second, we propose an instruction-guided latent fusion that pastes the multi-layered latent representations onto a canvas latent. We also introduce an artifact suppression scheme in the latent space to enhance the inpainting quality. Due to the inherent modular advantages of such multi-layered representations, we can achieve accurate image editing, and we demonstrate that our approach consistently surpasses the latest spatial editing methods, including Self-Guidance and DiffEditor. Last, we show that our approach is a unified framework that supports various accurate image editing tasks on more than six different editing tasks.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of **precise spatially - aware image editing**, especially how to achieve a unified framework for multiple spatially - aware image editing tasks without additional training. Specifically, the authors propose a method named **DesignEdit**, which can flexibly edit different objects in an image without relying on specific training or back - propagation updates by introducing multi - layer latent decomposition and fusion techniques. #### Main challenges 1. **Precise spatially - aware editing**: Existing image generation models (such as text - to - image generation models) have limitations when dealing with prompts that require numerical or spatial arrangement capabilities. For example, the generated image may not match the text description, resulting in a difference between user expectations and the actual result. 2. **Simultaneous multi - object editing**: Traditional image editing methods usually need to combine multiple editing guidance designs and update the latent representation through additional back - propagation, which makes it difficult to perform different operations on different objects simultaneously. 3. **High - quality background repair**: After removing or moving an object, the repair quality of the background area is crucial. Existing methods may introduce artifacts or distortion during the repair process. #### Solutions To solve the above problems, the authors propose a **training - free, forward - propagation, unified framework**, which is achieved through the following steps: 1. **Multi - layer latent decomposition**: - Divide the latent representation of the source image into multiple levels, including multiple object layers and an incomplete background layer. - Introduce a new **key - masked self - attention mechanism** to ensure high - quality repair of the background area while avoiding affecting the outer area. 2. **Multi - layer latent fusion**: - Paste the multi - layer latent representation onto the canvas latent representation according to the target layout arrangement. - Propose an **artifact suppression scheme** to enhance the background repair quality. 3. **Instruction - guided editing**: - Utilize the reasoning and planning capabilities of GPT - 4V to convert the user's vague editing instructions into detailed hierarchical editing instructions. Through these techniques, the authors show that their method can significantly outperform existing spatial editing methods (such as Self - Guidance and DiffEditor), and support multiple complex image editing tasks, such as object removal, resizing, moving, repeating, flipping, camera panning, zooming, compositing multiple images, and editing text or decorations. #### Summary The main contribution of this paper is to provide a **unified and accurate spatially - aware image editing framework** that can efficiently handle multiple image editing tasks without additional training and significantly improve the editing quality and accuracy.

DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing

Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image Guidance

BrushEdit: All-In-One Image Inpainting and Editing

ParallelEdits: Efficient Multi-Aspect Text-Driven Image Editing with Attention Grouping

LayerDiffusion: Layered Controlled Image Editing with Diffusion Models

InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions

MAG-Edit: Localized Image Editing in Complex Scenarios via Mask-Based Attention-Adjusted Guidance

FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction

InsightEdit: Towards Better Instruction Following for Image Editing

Multiple Facial Image Editing Using Edge-Aware PDE Learning.

Enhancing Text-to-Image Editing via Hybrid Mask-Informed Fusion

Vision-guided and Mask-enhanced Adaptive Denoising for Prompt-based Image Editing

DF-CLIP: Towards Disentangled and Fine-grained Image Editing from Text

An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control

SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing

Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks

FlexEdit: Marrying Free-Shape Masks to VLLM for Flexible Image Editing

Continuous Layout Editing of Single Images with Diffusion Models

Blended Latent Diffusion under Attention Control for Real-World Video Editing

Where You Edit is What You Get: Text-guided Image Editing with Region-Based Attention.

LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model