OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

Cong Wei,Zheyang Xiong,Weiming Ren,Xinrun Du,Ge Zhang,Wenhu Chen
2024-11-12
Abstract:Instruction-guided image editing methods have demonstrated significant potential by training diffusion models on automatically synthesized or manually annotated image editing pairs. However, these methods remain far from practical, real-life applications. We identify three primary challenges contributing to this gap. Firstly, existing models have limited editing skills due to the biased synthesis process. Secondly, these methods are trained with datasets with a high volume of noise and artifacts. This is due to the application of simple filtering methods like CLIP-score. Thirdly, all these datasets are restricted to a single low resolution and fixed aspect ratio, limiting the versatility to handle real-world use cases. In this paper, we present \omniedit, which is an omnipotent editor to handle seven different image editing tasks with any aspect ratio seamlessly. Our contribution is in four folds: (1) \omniedit is trained by utilizing the supervision from seven different specialist models to ensure task coverage. (2) we utilize importance sampling based on the scores provided by large multimodal models (like GPT-4o) instead of CLIP-score to improve the data quality. (3) we propose a new editing architecture called EditNet to greatly boost the editing success rate, (4) we provide images with different aspect ratios to ensure that our model can handle any image in the wild. We have curated a test set containing images of different aspect ratios, accompanied by diverse instructions to cover different tasks. Both automatic evaluation and human evaluations demonstrate that \omniedit can significantly outperform all the existing models. Our code, dataset and model will be available at \url{<a class="link-external link-https" href="https://tiger-ai-lab.github.io/OmniEdit/" rel="external noopener nofollow">this https URL</a>}
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the limitations of existing image - editing methods in practical applications, which are specifically manifested in the following aspects: 1. **Limited editing ability**: Existing image - editing models have limited editing skills due to the process of generating data with bias. For example, some models perform poorly in local editing (such as adding, deleting or swapping objects), while others are less effective in global editing (such as style or background changes). 2. **Poor data quality control**: Most methods use simplified filtering mechanisms (such as CLIP - score or DINO - score) to automatically select training samples, but these indicators have a low correlation with the actual data quality, resulting in poor training data quality and affecting model performance. 3. **No support for multiple resolutions**: All current models are trained only on square images, which limits their generalization ability for non - square images. To overcome these challenges, the paper proposes O MNI - EDIT, an all - purpose editor that can handle seven different image - editing tasks and support any aspect ratio. O MNI - EDIT solves the above problems through the following four key innovations: 1. **Expert - to - generalist supervision**: Train a general - purpose editing model O MNI - EDIT by using the supervision signals of multiple specialized models. Each specialized model focuses on different editing tasks and provides supervision signals to ensure task coverage. 2. **Importance sampling**: Use large - scale multi - modal models (such as GPT - 4o) to assign quality scores to synthetic samples to improve the quality of training data. Considering the computational cost of GPT - 4o, first distill its scoring ability to the medium - sized model InternVL2, and then use InternVL2 for large - scale scoring. 3. **EditNet architecture**: Introduce a new diffusion - transformer architecture EditNet, which promotes the interaction between the control branch and the original branch through intermediate representations, enhancing the ability of O MNI - EDIT to understand diverse editing tasks. 4. **Support for any aspect ratio**: During the training process, combine images with different aspect ratios and high resolutions to ensure that O MNI - EDIT can handle images of any aspect ratio without reducing the output quality. Through these innovations, O MNI - EDIT shows significant advantages in diverse image - editing tasks. It not only outperforms existing models in automatic evaluation metrics (such as VIEScore), but also shows higher perceptual quality and semantic consistency in human evaluations.