Abstract:Research in vision-language models has seen rapid developments off-late, enabling natural language-based interfaces for image generation and manipulation. Many existing text guided manipulation techniques are restricted to specific classes of images, and often require fine-tuning to transfer to a different style or domain. Nevertheless, generic image manipulation using a single model with flexible text inputs is highly desirable. Recent work addresses this task by guiding generative models trained on the generic image datasets using pretrained vision-language encoders. While promising, this approach requires expensive optimization for each input. In this work, we propose an optimization-free method for the task of generic image manipulation from text prompts. Our approach exploits recent Latent Diffusion Models (LDM) for text to image generation to achieve zero-shot text guided manipulation. We employ a deterministic forward diffusion in a lower dimensional latent space, and the desired manipulation is achieved by simply providing the target text to condition the reverse diffusion process. We refer to our approach as LDEdit. We demonstrate the applicability of our method on semantic image manipulation and artistic style transfer. Our method can accomplish image manipulation on diverse domains and enables editing multiple attributes in a straightforward fashion. Extensive experiments demonstrate the benefit of our approach over competing baselines.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to develop a fast and flexible method to achieve general image manipulation of open - domain images using arbitrary text prompts. Specifically, the author hopes that through a single model, a series of complex image - editing tasks can be completed according to text prompts, from simple color changes to multiple semantic - attribute modifications and artistic - style conversions. #### Main challenges 1. **Limitations of existing methods**: - Many existing text - guided image - manipulation techniques are limited to specific categories of images and usually need to be fine - tuned for different styles or domains. - Some methods need to fine - tune the model for specific text prompts, which further limits their practicality in flexible open - domain image manipulation. - Although some existing methods can handle general image manipulation, they require expensive optimization processes to adjust each input. 2. **Objectives**: - Achieve a wide range of image - manipulation tasks, including but not limited to changing the color of objects, modifying multiple semantic attributes and artistic styles of images. - Complete the above tasks using a single model without further optimization or fine - tuning. - Provide an efficient and intuitive user - guided editing tool that supports the parallel generation of diverse samples. #### Solution The author proposes a new method named **LDEdit**, which utilizes Latent Diffusion Models (LDM) and non - Markovian diffusion to achieve zero - sample text - guided image manipulation. The specific steps are as follows: 1. **Forward diffusion**: Encode the source image into a latent code \( z_0 \) in a low - dimensional latent space, and then gradually add noise to it through a deterministic forward diffusion process until time step \( t_{\text{stop}} \). 2. **Reverse diffusion**: Condition the reverse diffusion process using the target text prompt, starting from the same noisy latent code \( z_{t_{\text{stop}}} \) and gradually denoising to generate the desired editing result \( \hat{z}_0 \). 3. **Control randomness**: By introducing a controllable randomness parameter \( \eta \), a trade - off between diversity and fidelity can be made, so as to better handle targets that are quite different from the original input. #### Key innovation points - **Deterministic diffusion**: Through deterministic forward and reverse diffusion processes, approximate cycle - consistency between the source image and the target image is ensured. - **Zero - sample manipulation**: Multiple image - editing tasks can be completed without additional optimization or fine - tuning. - **Flexibility and efficiency**: It can be edited on different image domains, and its runtime performance is better than that of existing methods. ### Summary LDEdit provides a new framework that makes text - based image editing more flexible, efficient, and applicable to a wide range of image types and editing tasks. By using latent diffusion models and deterministic diffusion processes, LDEdit avoids the optimization costs and fine - tuning limitations common in traditional methods while maintaining high quality.

LDEdit: Towards Generalized Text Guided Image Manipulation via Latent Diffusion Models

LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance

Leveraging LLMs for On-the-Fly Instruction Guided Image Editing

Towards Real-time Text-driven Image Manipulation with Unconditional Diffusion Models

ClickDiffusion: Harnessing LLMs for Interactive Precise Image Editing

Latent Space Disentanglement in Diffusion Transformers Enables Precise Zero-shot Semantic Editing

PRedItOR: Text Guided Image Editing with Diffusion Prior

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Fine-grained Semantic Editing

Lightweight Text-Driven Image Editing With Disentangled Content and Attributes

Text-Driven Image Editing via Learnable Regions

Region-Aware Diffusion for Zero-shot Text-driven Image Editing

Forgedit: Text Guided Image Editing via Learning and Forgetting

ST-LDM: A Universal Framework for Text-Grounded Object Generation in Real Images

DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images

Learning to Follow Object-Centric Image Editing Instructions Faithfully

SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing

ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation

InstructGIE: Towards Generalizable Image Editing

A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models

Imagic: Text-Based Real Image Editing with Diffusion Models