Prompt-Based Modality Bridging for Unified Text-to-Face Generation and Manipulation

Yiyang Ma,Haowei Kuang,Huan Yang,Jianlong Fu,Jiaying Liu
DOI: https://doi.org/10.1145/3694974
2024-01-01
Abstract:Text-driven face image generation and manipulation are significant tasks. However, such tasks are quite challenging due to the gap between text and image modalities. It is difficult to utilize current methods to deal with both of the two problems because these methods are usually designed for one certain task, limiting their application in real scenarios. To address the two problems in one framework, we propose a Unified Prompt-based Cross-Modal Framework (UPCM-Frame) to bridge the gap between the text modality and image modality with CLIP and StyleGAN, which are two large-scale pre-trained models. The proposed framework is combined with two main modules: a Text Embedding-to-Image Embedding projection module based on a special prompt embedding pair, and a projection module mapping Image Embeddings to semantically aligned StyleGAN Embeddings which can be used in both image generation and manipulation. The proposed framework is able to handle complicated descriptions and generate impressive results with high quality due to the utilization of large-scale pre-trained models. In order to demonstrate the effectiveness of the proposed method in the two tasks, we conduct experiments to evaluate the results of our method both quantitatively and qualitatively.
What problem does this paper attempt to address?