Abstract:Diffusion Transformers (DiTs) have achieved remarkable success in diverse and high-quality text-to-image(T2I) generation. However, how text and image latents individually and jointly contribute to the semantics of generated images, remain largely unexplored. Through our investigation of DiT's latent space, we have uncovered key findings that unlock the potential for zero-shot fine-grained semantic editing: (1) Both the text and image spaces in DiTs are inherently decomposable. (2) These spaces collectively form a disentangled semantic representation space, enabling precise and fine-grained semantic control. (3) Effective image editing requires the combined use of both text and image latent spaces. Leveraging these insights, we propose a simple and effective Extract-Manipulate-Sample (EMS) framework for zero-shot fine-grained image editing. Our approach first utilizes a multi-modal Large Language Model to convert input images and editing targets into text descriptions. We then linearly manipulate text embeddings based on the desired editing degree and employ constrained score distillation sampling to manipulate image embeddings. We quantify the disentanglement degree of the latent space of diffusion models by proposing a new metric. To evaluate fine-grained editing performance, we introduce a comprehensive benchmark incorporating both human annotations, manual evaluation, and automatic metrics. We have conducted extensive experimental results and in-depth analysis to thoroughly uncover the semantic disentanglement properties of the diffusion transformer, as well as the effectiveness of our proposed method. Our annotated benchmark dataset is publicly available at <a class="link-external link-https" href="https://anonymous.com/anonymous/EMS-Benchmark" rel="external noopener nofollow">this https URL</a>, facilitating reproducible research in this domain.

Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model

CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Face Manipulation

Towards Interactive Facial Image Inpainting by Text or Exemplar Image.

Lightweight Text-Driven Image Editing With Disentangled Content and Attributes

Text-Guided Human Image Manipulation Via Image-Text Shared Space

DF-CLIP: Towards Disentangled and Fine-grained Image Editing from Text

Toward Effective Image Manipulation Detection with Proposal Contrastive Learning

DECap: Towards Generalized Explicit Caption Editing Via Diffusion Mechanism

Towards Effective Image Manipulation Detection with Proposal Contrastive Learning

DPE: Disentanglement of Pose and Expression for General Video Portrait Editing

LDEdit: Towards Generalized Text Guided Image Manipulation via Latent Diffusion Models

TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training

A Latent Transformer for Disentangled Face Editing in Images and Videos

ManiCLIP: Multi-attribute Face Manipulation from Text

Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Fine-grained Semantic Editing

DisControlFace: Disentangled Control for Personalized Facial Image Editing

LEED: Label-Free Expression Editing via Disentanglement

Joint Quality Assessment and Example-Guided Image Processing by Disentangling Picture Appearance from Content

DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing

StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing

An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control