Abstract:PDF HTML XML Export Cite reminder Text-driven Face Image Generation and Manipulation via Multi-level Residual Mapper DOI: 10.21655/ijsi.1673-7288.00313 Author: Affiliation: Clc Number: Fund Project: Article | Figures | Metrics | Reference | Related | Cited by | Materials | Comments Abstract:Although Generative Adversarial Networks (GANs) have achieved great success in face image generation and manipulation, discovering meaningful directions in the latent space of GANs to manipulate semantic attributes is a difficult but meaningful challenge in computer vision. The realization of this challenge typically requires large amounts of labeled data and several hours of network fine-tuning. However, obtaining an annotated collection of images for each desired manipulation is usually very expensive and time-consuming. Recent works aim to overcome this limitation by leveraging the pre-trained models. While they are promising, the accuracy of the manipulation and the authenticity of the results cannot meet the needs of real face editing scenarios. To address these problems, this paper encodes the image and text description into a shared embedding space and proposes a unified image generation and manipulation framework by leveraging the powerful joint representation capability from Contrastive Language-Image Pre-training (CLIP). With carefully designed network structures and loss functions, the proposed framework can learn a latent residual mapper network to map the input conditions into corresponding latent code residuals. This scheme enables the proposed method to perform high-quality face image generation and manipulation by leveraging the generative power from the pre-trained StyleGAN2 model. Extensive experiments demonstrate the superiority of the proposed approach in terms of manipulation accuracy, visual realism, and irrelevant attribute preservation. Reference Related Cited by

CLIPVG: Text-Guided Image Manipulation Using Differentiable Vector Graphics

TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training

DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation

DF-CLIP: Towards Disentangled and Fine-grained Image Editing from Text

GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis

Text-Guided Vector Graphics Customization

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance

Towards Counterfactual Image Manipulation via CLIP

DeltaSpace: A Semantic-aligned Feature Space for Flexible Text-guided Image Editing

E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance

Where You Edit is What You Get: Text-guided Image Editing with Region-Based Attention.

CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Face Manipulation

Diffusion Feedback Helps CLIP See Better

GrainedCLIP and DiffusionGrainedCLIP: Text-Guided Advanced Models for Fine-Grained Attribute Face Image Processing.

CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding

Text-driven Face Image Generation and Manipulation via Multi-level Residual Mapper

Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP

TGIEN: an Interpretable Image Editing Method for IoT Applications Based on Text Guidance

Text-Guided Human Image Manipulation Via Image-Text Shared Space

CgT-GAN: CLIP-guided Text GAN for Image Captioning