Abstract:Recently introduced Contrastive Language-Image Pre-Training (CLIP) bridges images and text by embedding them into a joint latent space. This opens the door to ample literature that aims to manipulate an input image by providing a textual explanation. However, due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images. Disentanglement, interpretability, and controllability are also hard to guarantee for manipulation. To alleviate these problems, we propose to define corpus subspaces spanned by relevant prompts to capture specific image characteristics. We introduce CLIP Projection-Augmentation Embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation. Our method is a simple and general paradigm that can be easily computed and adapted, and smoothly incorporated into any CLIP-based image manipulation algorithm. To demonstrate the effectiveness of our method, we conduct several theoretical and empirical studies. As a case study, we utilize the method for text-guided semantic face editing. We quantitatively and qualitatively demonstrate that PAE facilitates a more disentangled, interpretable, and controllable image manipulation with state-of-the-art quality and accuracy. Project page: <a class="link-external link-https" href="https://chenliang-zhou.github.io/CLIP-PAE/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper primarily addresses several key issues in text-guided image manipulation, particularly those encountered when using the Contrastive Language-Image Pre-Training (CLIP) framework. Specifically, the paper proposes solutions to the following problems: 1. **Unwanted Artifacts**: When directly using text embeddings as optimization targets, the generated images may exhibit undesirable changes or distortions. 2. **Disentanglement**: The ability to change only the attributes specified by the text prompt during image manipulation without affecting other unrelated attributes. 3. **Interpretability**: Understanding how the model makes decisions, i.e., why and how editing the latent code affects the output image. 4. **Controllability**: The ability to freely control the degree of change for each factor. To address the above issues, the authors propose a new technique called **Projection-Augmentation Embedding (PAE)**. PAE is an alternative optimization target in the CLIP joint space, designed to improve text-based image manipulation performance, especially for semantic editing of facial images. ### Main Contributions 1. **CLIP Space Analysis**: The authors conducted several empirical analyses revealing the limitations of directly using CLIP loss for text-guided image editing and discovered some unique properties of the CLIP subspace. 2. **Proposal of PAE**: Based on the above findings, the authors proposed PAE as an approximate method closer to the true target image embedding. PAE guides the image to change in the direction of the target text prompt by projecting the input image embedding into a subspace constructed from relevant text and enhancing it with specific techniques. 3. **Experimental Validation**: Through a series of text-guided facial semantic editing experiments, the authors demonstrated that using PAE can achieve more disentangled, interpretable, and controllable facial image manipulation. These experiments quantitatively evaluated PAE's performance and qualitatively demonstrated its advantages. ### Technical Details - **Non-overlapping Image and Text Embeddings**: The authors observed that image and text embeddings do not overlap in the CLIP joint space, leading to artifacts or irrelevant attribute changes during the direct optimization process. - **Subspace Construction**: By constructing a subspace containing relevant text descriptions, the image changes can be constrained to relevant attributes only. - **Construction of PAE**: PAE involves three steps: first, projecting the input image embedding into a subspace of relevant attributes; second, enhancing the influence of the target text within the subspace; and finally, adding back the residual to maintain the characteristics of the image region. Through these methods, PAE can better approximate the true target image embedding and improve text-guided image manipulation tasks based on CLIP in multiple aspects.

CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Face Manipulation

Towards Interactive Facial Image Inpainting by Text or Exemplar Image.

Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model

TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training

DF-CLIP: Towards Disentangled and Fine-grained Image Editing from Text

ManiCLIP: Multi-attribute Face Manipulation from Text

Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval

ExpCLIP: Bridging Text and Facial Expressions via Semantic Alignment

FaceCLIPNeRF: Text-driven 3D Face Manipulation using Deformable Neural Radiance Fields

GazeCLIP: Towards Enhancing Gaze Estimation via Text Guidance

Zero-shot Text-driven Physically Interpretable Face Editing

ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation

One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic Modulations

Revealing Directions for Text-guided 3D Face Editing

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

FEAT: Face Editing with Attention

IA-FaceS: A bidirectional method for semantic face editing

StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery

DeltaSpace: A Semantic-aligned Feature Space for Flexible Text-guided Image Editing

Long-CLIP: Unlocking the Long-Text Capability of CLIP

E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance