UltraEdit: Instruction-based Fine-Grained Image Editing at Scale

Haozhe Zhao,Xiaojian Ma,Liang Chen,Shuzheng Si,Rujie Wu,Kaikai An,Peiyu Yu,Minjia Zhang,Qing Li,Baobao Chang

2024-07-07

Abstract:This paper presents UltraEdit, a large-scale (approximately 4 million editing samples), automatically generated dataset for instruction-based image editing. Our key idea is to address the drawbacks in existing image editing datasets like InstructPix2Pix and MagicBrush, and provide a systematic approach to producing massive and high-quality image editing samples. UltraEdit offers several distinct advantages: 1) It features a broader range of editing instructions by leveraging the creativity of large language models (LLMs) alongside in-context editing examples from human raters; 2) Its data sources are based on real images, including photographs and artworks, which provide greater diversity and reduced bias compared to datasets solely generated by text-to-image models; 3) It also supports region-based editing, enhanced by high-quality, automatically produced region annotations. Our experiments show that canonical diffusion-based editing baselines trained on UltraEdit set new records on MagicBrush and Emu-Edit benchmarks. Our analysis further confirms the crucial role of real image anchors and region-based editing data. The dataset, code, and models can be found in <a class="link-external link-https" href="https://ultra-editing.github.io" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper proposes a solution to the problems of diverse image editing instructions, image bias, and lack of region editing data. Existing image editing datasets have limitations, such as InstructPix2Pix and MagicBrush. To overcome these limitations, the paper introduces ULTRA EDIT, a large-scale (approximately 4 million editing samples) automatically generated instruction-based image editing dataset. The characteristics of this dataset include: 1. Utilizing large language models (LLMs) and examples from human evaluators to generate a wider range of editing instructions. 2. Based on real images (such as photos and artworks) as data sources to enhance diversity and reduce bias. 3. Supporting region editing through high-quality automatic region annotation. The paper generates editing instructions using LLMs and prompts, and uses an existing text-to-image (T2I) diffusion model to generate source images and target (edited) images on real images, mitigating bias in the T2I model. In addition, they developed an automatic region generation method to generate editing regions from instructions and generate region editing samples using a modified inpainting diffusion pipeline. Experiments show that the diffusion-based editing baseline model achieves new records on MagicBrush and Emu-Edit benchmarks, confirming the importance of real image anchors and region editing data. The paper concludes by summarizing its contributions, including proposing a new method for generating image editing data, creating the large-scale, high-quality ULTRA EDIT dataset, and extensively investigating how to benefit from ULTRA EDIT.

UltraEdit: Instruction-based Fine-Grained Image Editing at Scale

AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea

HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing

HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing

FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction

InsightEdit: Towards Better Instruction Following for Image Editing

SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing

Emu Edit: Precise Image Editing via Recognition and Generation Tasks

I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing

MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing

InstructGIE: Towards Generalizable Image Editing

EditWorld: Simulating World Dynamics for Instruction-Following Image Editing

EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods

Multi-Reward as Condition for Instruction-based Image Editing

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

ParallelEdits: Efficient Multi-Aspect Text-Driven Image Editing with Attention Grouping

ByteEdit: Boost, Comply and Accelerate Generative Image Editing

InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions

DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing

Learning Action and Reasoning-Centric Image Editing from Videos and Simulations