UltraEdit: Instruction-based Fine-Grained Image Editing at Scale

Haozhe Zhao,Xiaojian Ma,Liang Chen,Shuzheng Si,Rujie Wu,Kaikai An,Peiyu Yu,Minjia Zhang,Qing Li,Baobao Chang
2024-07-07
Abstract:This paper presents UltraEdit, a large-scale (approximately 4 million editing samples), automatically generated dataset for instruction-based image editing. Our key idea is to address the drawbacks in existing image editing datasets like InstructPix2Pix and MagicBrush, and provide a systematic approach to producing massive and high-quality image editing samples. UltraEdit offers several distinct advantages: 1) It features a broader range of editing instructions by leveraging the creativity of large language models (LLMs) alongside in-context editing examples from human raters; 2) Its data sources are based on real images, including photographs and artworks, which provide greater diversity and reduced bias compared to datasets solely generated by text-to-image models; 3) It also supports region-based editing, enhanced by high-quality, automatically produced region annotations. Our experiments show that canonical diffusion-based editing baselines trained on UltraEdit set new records on MagicBrush and Emu-Edit benchmarks. Our analysis further confirms the crucial role of real image anchors and region-based editing data. The dataset, code, and models can be found in <a class="link-external link-https" href="https://ultra-editing.github.io" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper proposes a solution to the problems of diverse image editing instructions, image bias, and lack of region editing data. Existing image editing datasets have limitations, such as InstructPix2Pix and MagicBrush. To overcome these limitations, the paper introduces ULTRA EDIT, a large-scale (approximately 4 million editing samples) automatically generated instruction-based image editing dataset. The characteristics of this dataset include: 1. Utilizing large language models (LLMs) and examples from human evaluators to generate a wider range of editing instructions. 2. Based on real images (such as photos and artworks) as data sources to enhance diversity and reduce bias. 3. Supporting region editing through high-quality automatic region annotation. The paper generates editing instructions using LLMs and prompts, and uses an existing text-to-image (T2I) diffusion model to generate source images and target (edited) images on real images, mitigating bias in the T2I model. In addition, they developed an automatic region generation method to generate editing regions from instructions and generate region editing samples using a modified inpainting diffusion pipeline. Experiments show that the diffusion-based editing baseline model achieves new records on MagicBrush and Emu-Edit benchmarks, confirming the importance of real image anchors and region editing data. The paper concludes by summarizing its contributions, including proposing a new method for generating image editing data, creating the large-scale, high-quality ULTRA EDIT dataset, and extensively investigating how to benefit from ULTRA EDIT.