MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance

Debin Meng,Christos Tzelepis,Ioannis Patras,Georgios Tzimiropoulos
2024-09-17
Abstract:Generating human portraits is a hot topic in the image generation area, e.g. mask-to-face generation and text-to-face generation. However, these unimodal generation methods lack controllability in image generation. Controllability can be enhanced by exploring the advantages and complementarities of various modalities. For instance, we can utilize the advantages of text in controlling diverse attributes and masks in controlling spatial locations. Current state-of-the-art methods in multimodal generation face limitations due to their reliance on extensive hyperparameters, manual operations during the inference stage, substantial computational demands during training and inference, or inability to edit real images. In this paper, we propose a practical framework - MM2Latent - for multimodal image generation and editing. We use StyleGAN2 as our image generator, FaRL for text encoding, and train an autoencoders for spatial modalities like mask, sketch and 3DMM. We propose a strategy that involves training a mapping network to map the multimodal input into the w latent space of StyleGAN. The proposed framework 1) eliminates hyperparameters and manual operations in the inference stage, 2) ensures fast inference speeds, and 3) enables the editing of real images. Extensive experiments demonstrate that our method exhibits superior performance in multimodal image generation, surpassing recent GAN- and diffusion-based methods. Also, it proves effective in multimodal image editing and is faster than GAN- and diffusion-based methods. We make the code publicly available at: <a class="link-external link-https" href="https://github.com/Open-Debin/MM2Latent" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the controllability and efficiency of multi - modal face image generation and editing. Specifically, the author points out the shortcomings of existing unimodal generation methods (such as text - based or mask - based generation) in image generation, that is, these methods lack effective control over the generation process. In addition, although existing multi - modal generation methods combine information from multiple modalities, they rely on a large amount of hyper - parameter adjustment, manual operation at the inference stage, and have high computational requirements, making it difficult to be applied to real - image editing. To solve these problems, the paper proposes a new framework named **MM2Latent**, aiming to improve multi - modal face image generation and editing in the following aspects: 1. **Eliminating hyper - parameter adjustment and manual operation**: MM2Latent does not require hyper - parameter adjustment or manual operation at the inference stage, thus simplifying the usage process. 2. **Ensuring fast inference speed**: Compared with existing GAN and diffusion model methods, MM2Latent can achieve a faster inference speed. 3. **Supporting real - image editing**: MM2Latent can be used to edit real images, not just synthetic images. ### Technical Details To achieve the above goals, MM2Latent mainly adopts the following technical components: - **StyleGAN2** is used as an image generator, taking advantage of its rich semantically decoupled W - latent space to generate high - quality face images. - **FaRL** is used for text encoding. FaRL is a vision - language joint model that can align text and images to the same feature space. - **Auto - encoders** are used to process spatial modalities (such as masks, sketches, and 3DMM parameters). These auto - encoders can extract the feature representations of these modalities. - **MappingNet (Mapping Network)** is used to map multi - modal inputs into the W - latent space of StyleGAN, thereby achieving multi - modal fusion. ### Training and Inference During the training process, MM2Latent uses the image embeddings of FaRL to generate pseudo - text embeddings to enhance the generalization ability of the model. At the inference stage, real text embeddings can be directly used for multi - modal image generation. For image - editing tasks, MM2Latent can be edited by navigating the semantic directions in the W - latent space, such as making the face look older or adding a beard, etc. ### Experimental Results The paper verifies the effectiveness of MM2Latent through a series of experiments, including: - **Multi - modal consistency evaluation**: CLIP Score and Mask Accuracy are used to evaluate the consistency between the generated image and the text and mask. - **Image quality evaluation**: The CMMD metric is used to evaluate the quality of the generated image. - **Comparative experiments**: Comparisons are made with existing multi - modal generation methods (such as TediGAN, Composal, UniteConquer, and Collaborative Diffusion), and the results show that MM2Latent outperforms other methods in multiple metrics. In general, this paper proposes an efficient and controllable multi - modal face image generation and editing framework, solves the problems existing in existing methods, and shows significant performance improvement.