Abstract:Instruction-guided image editing methods have demonstrated significant potential by training diffusion models on automatically synthesized or manually annotated image editing pairs. However, these methods remain far from practical, real-life applications. We identify three primary challenges contributing to this gap. Firstly, existing models have limited editing skills due to the biased synthesis process. Secondly, these methods are trained with datasets with a high volume of noise and artifacts. This is due to the application of simple filtering methods like CLIP-score. Thirdly, all these datasets are restricted to a single low resolution and fixed aspect ratio, limiting the versatility to handle real-world use cases. In this paper, we present \omniedit, which is an omnipotent editor to handle seven different image editing tasks with any aspect ratio seamlessly. Our contribution is in four folds: (1) \omniedit is trained by utilizing the supervision from seven different specialist models to ensure task coverage. (2) we utilize importance sampling based on the scores provided by large multimodal models (like GPT-4o) instead of CLIP-score to improve the data quality. (3) we propose a new editing architecture called EditNet to greatly boost the editing success rate, (4) we provide images with different aspect ratios to ensure that our model can handle any image in the wild. We have curated a test set containing images of different aspect ratios, accompanied by diverse instructions to cover different tasks. Both automatic evaluation and human evaluations demonstrate that \omniedit can significantly outperform all the existing models. Our code, dataset and model will be available at \url{<a class="link-external link-https" href="https://tiger-ai-lab.github.io/OmniEdit/" rel="external noopener nofollow">this https URL</a>}

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the limitations of existing image - editing methods in practical applications, which are specifically manifested in the following aspects: 1. **Limited editing ability**: Existing image - editing models have limited editing skills due to the process of generating data with bias. For example, some models perform poorly in local editing (such as adding, deleting or swapping objects), while others are less effective in global editing (such as style or background changes). 2. **Poor data quality control**: Most methods use simplified filtering mechanisms (such as CLIP - score or DINO - score) to automatically select training samples, but these indicators have a low correlation with the actual data quality, resulting in poor training data quality and affecting model performance. 3. **No support for multiple resolutions**: All current models are trained only on square images, which limits their generalization ability for non - square images. To overcome these challenges, the paper proposes O MNI - EDIT, an all - purpose editor that can handle seven different image - editing tasks and support any aspect ratio. O MNI - EDIT solves the above problems through the following four key innovations: 1. **Expert - to - generalist supervision**: Train a general - purpose editing model O MNI - EDIT by using the supervision signals of multiple specialized models. Each specialized model focuses on different editing tasks and provides supervision signals to ensure task coverage. 2. **Importance sampling**: Use large - scale multi - modal models (such as GPT - 4o) to assign quality scores to synthetic samples to improve the quality of training data. Considering the computational cost of GPT - 4o, first distill its scoring ability to the medium - sized model InternVL2, and then use InternVL2 for large - scale scoring. 3. **EditNet architecture**: Introduce a new diffusion - transformer architecture EditNet, which promotes the interaction between the control branch and the original branch through intermediate representations, enhancing the ability of O MNI - EDIT to understand diverse editing tasks. 4. **Support for any aspect ratio**: During the training process, combine images with different aspect ratios and high resolutions to ensure that O MNI - EDIT can handle images of any aspect ratio without reducing the output quality. Through these innovations, O MNI - EDIT shows significant advantages in diverse image - editing tasks. It not only outperforms existing models in automatic evaluation metrics (such as VIEScore), but also shows higher perceptual quality and semantic consistency in human evaluations.

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

Emu Edit: Precise Image Editing via Recognition and Generation Tasks

AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea

OmniCreator: Self-Supervised Unified Generation with Universal Editing

InsightEdit: Towards Better Instruction Following for Image Editing

Omni-IML: Towards Unified Image Manipulation Localization

FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction

InstructGIE: Towards Generalizable Image Editing

ReEdit: Multimodal Exemplar-Based Image Editing with Diffusion Models

Learning to Follow Object-Centric Image Editing Instructions Faithfully

UniHuman: A Unified Model for Editing Human Images in the Wild

Object-aware Inversion and Reassembly for Image Editing

UltraEdit: Instruction-based Fine-Grained Image Editing at Scale

Text-driven Editing of 3D Scenes without Retraining

Multi-Reward as Condition for Instruction-based Image Editing

Learning Action and Reasoning-Centric Image Editing from Videos and Simulations

ParallelEdits: Efficient Multi-Aspect Text-Driven Image Editing with Attention Grouping

Lightweight Text-Driven Image Editing With Disentangled Content and Attributes

Free-Editor: Zero-shot Text-driven 3D Scene Editing

InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions

OBJECT 3DIT: Language-guided 3D-aware Image Editing