Abstract:Our work addresses limitations seen in previous approaches for object-centric editing problems, such as unrealistic results due to shape discrepancies and limited control in object replacement or insertion. To this end, we introduce FlexEdit, a flexible and controllable editing framework for objects where we iteratively adjust latents at each denoising step using our FlexEdit block. Initially, we optimize latents at test time to align with specified object constraints. Then, our framework employs an adaptive mask, automatically extracted during denoising, to protect the background while seamlessly blending new content into the target image. We demonstrate the versatility of FlexEdit in various object editing tasks and curate an evaluation test suite with samples from both real and synthetic images, along with novel evaluation metrics designed for object-centric editing. We conduct extensive experiments on different editing scenarios, demonstrating the superiority of our editing framework over recent advanced text-guided image editing methods. Our project page is published at

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in object - centered image editing, existing methods have some limitations, such as the generated object shapes being unrealistic and the limited ability to control object replacement or insertion. To overcome these problems, the authors propose a flexible and controllable editing framework - FlexEdit, aiming to achieve the following goals: 1. **Object Replacement**: Be able to flexibly adjust the size and position of the replacement object to better conform to the user's editing intention. 2. **Object Addition**: Be able to add new objects naturally without using additional mask inputs. 3. **Object Removal**: When removing an object, it will not affect the quality of the original image. Specifically, FlexEdit achieves these goals by adjusting latent variables in each denoising step and using adaptive masks to protect background information. In addition, the authors also introduce new evaluation datasets and metrics to better evaluate object - centered image editing tasks. ### Main Contributions 1. **Propose a New Editing Framework**: For object - centered image editing tasks, a flexible and controllable editing framework, FlexEdit, is proposed. 2. **Introduce a New Test Suite**: Including test samples and new evaluation metrics, specifically for object - centered image editing. 3. **Conduct Extensive Evaluations**: Comparative experiments with the latest editing algorithms on different benchmark datasets are carried out, demonstrating the superiority of FlexEdit in various flexible and customizable object editing applications. ### Method Overview 1. **Latent Optimization**: Obtain editing semantics by optimizing latent variables, including size and position control during object replacement and attention separation during object addition. 2. **Latent Fusion**: Use an adaptive binary mask to fuse the edited latent variables with the background information of the source image to ensure seamless connection between the editing area and the background. 3. **Iterative Latent Manipulation**: Ensure the quality of the editing results by iteratively performing latent optimization and latent fusion. ### Experimental Results The authors conducted experiments on multiple datasets, including MagicO, PieBenchO, and SynO. The results show that FlexEdit is superior to existing editing methods in both background preservation and editing semantics. In particular, in tasks such as object replacement, object addition, and object removal, FlexEdit exhibits higher flexibility and control ability. ### Formulas - **Latent Optimization Loss Function**: \[ L_{\text{pos}}=\| \text{centroid}_{j,t}-\text{centroid}^*_{t} \|_2^2 \] \[ L_{\text{size}}=\| \text{size}_{j,t}-\text{size}^*_{t} \|_2^2 \] - **Separation Loss Function**: \[ L_{\text{sep}}=\frac{\sum_{k = 1}^{H\times W} f_{j,t,k}\cdot g_{i,k}}{\| f_{j,t} \|_2^2\cdot \| g_i \|_2^2} \] - **Latent Fusion**: \[ z^*_t = z''_t\odot\hat{M}_t+z_t\odot(1 - \hat{M}_t) \] These formulas ensure that FlexEdit can flexibly control object attributes during the editing process and maintain the integrity of background information.

FlexEdit: Flexible and Controllable Diffusion-based Object-centric Image Editing

FlexiEdit: Frequency-Aware Latent Refinement for Enhanced Non-Rigid Editing

DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing

LoMOE: Localized Multi-Object Editing via Multi-Diffusion

SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion

FlexEdit: Marrying Free-Shape Masks to VLLM for Flexible Image Editing

PAIR-Diffusion: A Comprehensive Multimodal Object-Level Image Editor

High-Fidelity Diffusion-based Image Editing

Streamlining Image Editing with Layered Diffusion Brushes

Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image Guidance

An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control

Edicho: Consistent Image Editing in the Wild

InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions

FunEditor: Achieving Complex Image Edits via Function Aggregation with Diffusion Models

Blended Diffusion for Text-driven Editing of Natural Images

DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing

Inversion-Free Image Editing with Natural Language

FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction

EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM

FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models

PFB-Diff: Progressive Feature Blending diffusion for text-driven image editing