Abstract:Diffusion models have emerged as a powerful generative technology and have been found to be applicable in various scenarios. Most existing foundational diffusion models are primarily designed for text-guided visual generation and do not support multi-modal conditions, which are essential for many visual editing tasks. This limitation prevents these foundational diffusion models from serving as a unified model in the field of visual generation, like GPT-4 in the natural language processing field. In this work, we propose ACE, an All-round Creator and Editor, which achieves comparable performance compared to those expert models in a wide range of visual generation tasks. To achieve this goal, we first introduce a unified condition format termed Long-context Condition Unit (LCU), and propose a novel Transformer-based diffusion model that uses LCU as input, aiming for joint training across various generation and editing tasks. Furthermore, we propose an efficient data collection approach to address the issue of the absence of available training data. It involves acquiring pairwise images with synthesis-based or clustering-based pipelines and supplying these pairs with accurate textual instructions by leveraging a fine-tuned multi-modal large language model. To comprehensively evaluate the performance of our model, we establish a benchmark of manually annotated pairs data across a variety of visual generation tasks. The extensive experimental results demonstrate the superiority of our model in visual generation fields. Thanks to the all-in-one capabilities of our model, we can easily build a multi-modal chat system that responds to any interactive request for image creation using a single model to serve as the backend, avoiding the cumbersome pipeline typically employed in visual agents. Code and models will be available on the project page: <a class="link-external link-https" href="https://ali-vilab.github.io/ace-page/" rel="external noopener nofollow">this https URL</a>.

Edit Everything: A Text-Guided Generative System for Images Editing

InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions

DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing

Forgedit: Text Guided Image Editing via Learning and Forgetting

AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing

ByteEdit: Boost, Comply and Accelerate Generative Image Editing

SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing

InstructGIE: Towards Generalizable Image Editing

Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image Guidance

SeedEdit: Align Image Re-Generation to Image Editing

A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models

DeltaSpace: A Semantic-aligned Feature Space for Flexible Text-guided Image Editing

Text-Driven Image Editing via Learnable Regions

EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods

Where You Edit is What You Get: Text-guided Image Editing with Region-Based Attention.

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance

StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing

DiffUTE: Universal Text Editing Diffusion Model

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer