Towards Open-World Text-Guided Face Image Generation and Manipulation

Weihao Xia,Yujiu Yang,Jing-Hao Xue,Baoyuan Wu
DOI: https://doi.org/10.48550/arXiv.2104.08910
2021-04-19
Abstract:The existing text-guided image synthesis methods can only produce limited quality results with at most \mbox{$\text{256}^2$} resolution and the textual instructions are constrained in a small Corpus. In this work, we propose a unified framework for both face image generation and manipulation that produces diverse and high-quality images with an unprecedented resolution at 1024 from multimodal inputs. More importantly, our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing. To be specific, we propose a brand new paradigm of text-guided image generation and manipulation based on the superior characteristics of a pretrained GAN model. Our proposed paradigm includes two novel strategies. The first strategy is to train a text encoder to obtain latent codes that align with the hierarchically semantic of the aforementioned pretrained GAN model. The second strategy is to directly optimize the latent codes in the latent space of the pretrained GAN model with guidance from a pretrained language model. The latent codes can be randomly sampled from a prior distribution or inverted from a given image, which provides inherent supports for both image generation and manipulation from multi-modal inputs, such as sketches or semantic labels, with textual guidance. To facilitate text-guided multi-modal synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of our proposed method. Code and data are available at <a class="link-external link-https" href="https://github.com/weihaox/TediGAN" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the limitations of existing text - guided image synthesis methods in terms of the quality of generated images and the flexibility of text instructions. Specifically, existing methods can only generate results of limited quality, with the highest resolution being only 256x256, and text instructions are restricted by a relatively small corpus. Moreover, these methods have difficulties in handling high - resolution images, especially when generating images with a resolution of 1024x1024. The multi - stage training process is time - consuming and complex and difficult to achieve. Another serious problem is that the generalization ability of existing methods is poor. Methods trained on small data sets are usually unable to handle out - of - distribution data, let alone open - world images and texts of various complexities. To overcome these problems, the author proposes a unified framework for the generation and manipulation of face images. This framework can generate diverse and high - quality images from multi - modal inputs, with a resolution of up to 1024x1024. More importantly, this method supports open - world scenarios, including images and texts, without any retraining, fine - tuning or post - processing. The author proposes two novel strategies to achieve this goal: 1. **The first strategy**: Train a text encoder to obtain latent codes that are hierarchically semantically aligned with the pre - trained GAN model. This is achieved through three modules: - **Image encoder training module**: Train an image encoder to find the latent code of a given image in the W space. - **Visual - language similarity module**: Learn to align language representations with visual representations by projecting images and texts into the common W space. - **Instance - level optimization module**: Keep the identity unchanged during the editing process, accurately manipulate the desired attributes, while faithfully reconstructing the irrelevant attributes. 2. **The second strategy**: Directly optimize the latent code in the latent space of the pre - trained GAN model, using a pre - trained language model for guidance. This method can create images from open - world images or texts and support the operation of the region of interest of a given image. In addition, to promote text - guided multi - modal synthesis, the author also introduces a large - scale data set MULTI - MODAL CELEB A - HQ, which contains real - face images and their corresponding semantic segmentation maps, sketches and text descriptions. Through these innovations, this paper aims to provide a reliable and efficient method that can maintain high quality and diversity when generating and manipulating images, while supporting multi - modal inputs and open - world scenarios.