CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Face Manipulation

Chenliang Zhou,Fangcheng Zhong,Cengiz Oztireli
2024-07-12
Abstract:Recently introduced Contrastive Language-Image Pre-Training (CLIP) bridges images and text by embedding them into a joint latent space. This opens the door to ample literature that aims to manipulate an input image by providing a textual explanation. However, due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images. Disentanglement, interpretability, and controllability are also hard to guarantee for manipulation. To alleviate these problems, we propose to define corpus subspaces spanned by relevant prompts to capture specific image characteristics. We introduce CLIP Projection-Augmentation Embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation. Our method is a simple and general paradigm that can be easily computed and adapted, and smoothly incorporated into any CLIP-based image manipulation algorithm. To demonstrate the effectiveness of our method, we conduct several theoretical and empirical studies. As a case study, we utilize the method for text-guided semantic face editing. We quantitatively and qualitatively demonstrate that PAE facilitates a more disentangled, interpretable, and controllable image manipulation with state-of-the-art quality and accuracy. Project page: <a class="link-external link-https" href="https://chenliang-zhou.github.io/CLIP-PAE/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper primarily addresses several key issues in text-guided image manipulation, particularly those encountered when using the Contrastive Language-Image Pre-Training (CLIP) framework. Specifically, the paper proposes solutions to the following problems: 1. **Unwanted Artifacts**: When directly using text embeddings as optimization targets, the generated images may exhibit undesirable changes or distortions. 2. **Disentanglement**: The ability to change only the attributes specified by the text prompt during image manipulation without affecting other unrelated attributes. 3. **Interpretability**: Understanding how the model makes decisions, i.e., why and how editing the latent code affects the output image. 4. **Controllability**: The ability to freely control the degree of change for each factor. To address the above issues, the authors propose a new technique called **Projection-Augmentation Embedding (PAE)**. PAE is an alternative optimization target in the CLIP joint space, designed to improve text-based image manipulation performance, especially for semantic editing of facial images. ### Main Contributions 1. **CLIP Space Analysis**: The authors conducted several empirical analyses revealing the limitations of directly using CLIP loss for text-guided image editing and discovered some unique properties of the CLIP subspace. 2. **Proposal of PAE**: Based on the above findings, the authors proposed PAE as an approximate method closer to the true target image embedding. PAE guides the image to change in the direction of the target text prompt by projecting the input image embedding into a subspace constructed from relevant text and enhancing it with specific techniques. 3. **Experimental Validation**: Through a series of text-guided facial semantic editing experiments, the authors demonstrated that using PAE can achieve more disentangled, interpretable, and controllable facial image manipulation. These experiments quantitatively evaluated PAE's performance and qualitatively demonstrated its advantages. ### Technical Details - **Non-overlapping Image and Text Embeddings**: The authors observed that image and text embeddings do not overlap in the CLIP joint space, leading to artifacts or irrelevant attribute changes during the direct optimization process. - **Subspace Construction**: By constructing a subspace containing relevant text descriptions, the image changes can be constrained to relevant attributes only. - **Construction of PAE**: PAE involves three steps: first, projecting the input image embedding into a subspace of relevant attributes; second, enhancing the influence of the target text within the subspace; and finally, adding back the residual to maintain the characteristics of the image region. Through these methods, PAE can better approximate the true target image embedding and improve text-guided image manipulation tasks based on CLIP in multiple aspects.