A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models

Xincheng Shuai,Henghui Ding,Xingjun Ma,Rongcheng Tu,Yu-Gang Jiang,Dacheng Tao
2024-06-21
Abstract:Image editing aims to edit the given synthetic or real image to meet the specific requirements from users. It is widely studied in recent years as a promising and challenging field of Artificial Intelligence Generative Content (AIGC). Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models, which generate images according to text prompts. These models demonstrate remarkable generative capabilities and have become widely used tools for image editing. T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs. In this survey, we provide a comprehensive review of multimodal-guided image editing techniques that leverage T2I diffusion models. First, we define the scope of image editing from a holistic perspective and detail various control signals and editing scenarios. We then propose a unified framework to formalize the editing process, categorizing it into two primary algorithm families. This framework offers a design space for users to achieve specific goals. Subsequently, we present an in-depth analysis of each component within this framework, examining the characteristics and applicable scenarios of different combinations. Given that training-based methods learn to directly map the source image to target one under user guidance, we discuss them separately, and introduce injection schemes of source image in different scenarios. Additionally, we review the application of 2D techniques to video editing, highlighting solutions for inter-frame inconsistency. Finally, we discuss open challenges in the field and suggest potential future research directions. We keep tracing related works at <a class="link-external link-https" href="https://github.com/xinchengshuai/Awesome-Image-Editing" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to use text - to - image (T2I) diffusion models to achieve multi - modal - guided image editing. Specifically, the authors focus on improving the quality and flexibility of image editing by combining different forms of user input (such as text, image, user - interface operations, etc.), and providing a user - friendly interaction method for users. ### Main Problem Decomposition 1. **Define the scope of image editing**: - The authors believe that previous literature has defined image editing too narrowly, usually only focusing on reconstructing as many details as possible from the source image while ignoring some high - level semantic information (such as identity, style, etc.). Therefore, this paper aims to provide a more strict and comprehensive definition of image editing. 2. **Propose a unified framework**: - In order to better understand and organize existing multi - modal - guided image editing techniques, the authors propose a unified framework, which divides the editing process into two major algorithm families: the Inversion Algorithm and the Editing Algorithm. This framework not only helps users choose appropriate methods according to specific requirements but also provides a design space to achieve specific goals. 3. **Analyze existing methods**: - The article analyzes in detail the characteristics and applicable scenarios of each method, including attention - mechanism - based editing algorithms, fusion - based editing algorithms, score - based editing algorithms, and optimization - based editing algorithms, etc. 4. **Explore video editing extensions**: - Although the main discussion is about image editing, the article also involves the application of 2D image editing techniques in video editing, especially the solutions to the inter - frame consistency problem. 5. **Future research directions**: - Finally, the authors discuss the challenges that have not been solved in the current field and propose potential future research directions. ### Formula Examples The formulas involved in the article are mainly used to describe the working principle of the diffusion model. For example: - Forward Process: \[ z_t=\sqrt{\bar{\alpha}_t} z_0+\sqrt{1 - \bar{\alpha}_t} \epsilon_0 \] where \(\bar{\alpha}_t = \prod_{i = 1}^t \alpha_i\), \(\alpha_t=1 - \beta_t\), \(\epsilon_0\sim\mathcal{N}(0, I)\). - Backward Process: \[ z_{t - 1}=\frac{1}{\sqrt{\alpha_t}}\left(z_t - \beta_t\sqrt{\frac{1 - \bar{\alpha}_{t - 1}}{1 - \bar{\alpha}_t}} \epsilon_\theta(z_t, t)\right)+\sigma_t\hat{\epsilon}_t \] These formulas are helpful for understanding how the diffusion model gradually denoises and generates clear images. ### Summary This paper, through a comprehensive review of multi - modal - guided image editing techniques, not only expands the definition of image editing but also provides a systematic framework to classify and analyze existing methods. This provides important references and guidance for future research and practical applications.