Abstract:Diffusion Transformers (DiTs) have recently achieved remarkable success in text-guided image generation. In image editing, DiTs project text and image inputs to a joint latent space, from which they decode and synthesize new images. However, it remains largely unexplored how multimodal information collectively forms this joint space and how they guide the semantics of the synthesized images. In this paper, we investigate the latent space of DiT models and uncover two key properties: First, DiT's latent space is inherently semantically disentangled, where different semantic attributes can be controlled by specific editing directions. Second, consistent semantic editing requires utilizing the entire joint latent space, as neither encoded image nor text alone contains enough semantic information. We show that these editing directions can be obtained directly from text prompts, enabling precise semantic control without additional training or mask annotations. Based on these insights, we propose a simple yet effective Encode-Identify-Manipulate (EIM) framework for zero-shot fine-grained image editing. Specifically, we first encode both the given source image and the text prompt that describes the image, to obtain the joint latent embedding. Then, using our proposed Hessian Score Distillation Sampling (HSDS) method, we identify editing directions that control specific target attributes while preserving other image features. These directions are guided by text prompts and used to manipulate the latent embeddings. Moreover, we propose a new metric to quantify the disentanglement degree of the latent space of diffusion models. Extensive experiment results on our new curated benchmark dataset and analysis demonstrate DiT's disentanglement properties and effectiveness of the EIM framework.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper attempts to address how to achieve precise zero-shot semantic editing in Diffusion Transformers (DiT). Specifically, the authors explore the disentanglement properties in the latent space of the DiT model and propose a simple yet effective Encode-Identify-Manipulate (EIM) framework to achieve precise semantic editing without additional training or mask annotations. #### Main Issues: 1. **Disentanglement Properties of Latent Space**: Does the joint latent space of the DiT model possess semantic disentanglement properties? That is, can different semantic attributes be independently controlled? 2. **Precise Semantic Editing**: How can the disentanglement properties of the DiT model be utilized to achieve precise, fine-grained semantic editing without requiring additional training or annotations? #### Background: - **Diffusion Models**: Diffusion models have achieved significant success in text-guided image generation tasks, capable of generating diverse and high-fidelity images and videos based on text prompts. - **DiT Model**: The DiT model enhances text-to-image controllability by embedding input images and text into a joint latent space and processing them through stacked self-attention layers. - **Limitations of Existing Methods**: Existing methods typically rely on loss-driven approaches or attention maps to link image semantics with text inputs, requiring additional annotations or extensive optimization, which limits their interpretability and generalizability in broad applications. #### Solution: - **Latent Space Analysis**: The authors systematically study the latent embedding space of the DiT model and propose a new metric to quantify its degree of disentanglement. - **EIM Framework**: Based on the disentanglement properties of the latent space, the authors propose the EIM framework for zero-shot image editing, achieving precise and fine-grained semantic control. - **Benchmark Dataset**: To evaluate the effectiveness of the EIM framework, the authors introduce a new benchmark dataset (ZOPIE), including both automatic and manual annotations, to assess the performance of precise image editing tasks and the disentanglement properties of generative models. #### Main Contributions: 1. **Systematic Study and Modeling**: For the first time, the paper comprehensively reveals the important disentanglement properties of the DiT model's latent space, laying the foundation for controllable image editing. 2. **EIM Method**: The paper proposes a simple yet effective EIM method, achieving zero-shot image editing through the Hessian Score Distillation Sampling method without additional training or mask annotations. 3. **ZOPIE Benchmark**: The paper introduces a new benchmark dataset to comprehensively evaluate the performance of precise image editing tasks and the disentanglement properties of generative models. Through these contributions, the paper aims to advance the application of Diffusion Transformers in the field of image editing, achieving more precise and controllable semantic editing.

Latent Space Disentanglement in Diffusion Transformers Enables Precise Zero-shot Semantic Editing

Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Fine-grained Semantic Editing

DiT4Edit: Diffusion Transformer for Image Editing

LDEdit: Towards Generalized Text Guided Image Manipulation via Latent Diffusion Models

Enabling Local Editing in Diffusion Models by Joint and Individual Component Analysis

SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer

DECap: Towards Generalized Explicit Caption Editing Via Diffusion Mechanism

Lightweight Text-Driven Image Editing With Disentangled Content and Attributes

Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts

The Curious Case of End Token: A Zero-Shot Disentangled Image Editing using CLIP

Hierarchical Diffusion Autoencoders and Disentangled Image Manipulation

LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance

Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry

An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Forgedit: Text Guided Image Editing via Learning and Forgetting

Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code

DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing

Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks

Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models

DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing