Abstract:Generating human portraits is a hot topic in the image generation area, e.g. mask-to-face generation and text-to-face generation. However, these unimodal generation methods lack controllability in image generation. Controllability can be enhanced by exploring the advantages and complementarities of various modalities. For instance, we can utilize the advantages of text in controlling diverse attributes and masks in controlling spatial locations. Current state-of-the-art methods in multimodal generation face limitations due to their reliance on extensive hyperparameters, manual operations during the inference stage, substantial computational demands during training and inference, or inability to edit real images. In this paper, we propose a practical framework - MM2Latent - for multimodal image generation and editing. We use StyleGAN2 as our image generator, FaRL for text encoding, and train an autoencoders for spatial modalities like mask, sketch and 3DMM. We propose a strategy that involves training a mapping network to map the multimodal input into the w latent space of StyleGAN. The proposed framework 1) eliminates hyperparameters and manual operations in the inference stage, 2) ensures fast inference speeds, and 3) enables the editing of real images. Extensive experiments demonstrate that our method exhibits superior performance in multimodal image generation, surpassing recent GAN- and diffusion-based methods. Also, it proves effective in multimodal image editing and is faster than GAN- and diffusion-based methods. We make the code publicly available at: <a class="link-external link-https" href="https://github.com/Open-Debin/MM2Latent" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the controllability and efficiency of multi - modal face image generation and editing. Specifically, the author points out the shortcomings of existing unimodal generation methods (such as text - based or mask - based generation) in image generation, that is, these methods lack effective control over the generation process. In addition, although existing multi - modal generation methods combine information from multiple modalities, they rely on a large amount of hyper - parameter adjustment, manual operation at the inference stage, and have high computational requirements, making it difficult to be applied to real - image editing. To solve these problems, the paper proposes a new framework named **MM2Latent**, aiming to improve multi - modal face image generation and editing in the following aspects: 1. **Eliminating hyper - parameter adjustment and manual operation**: MM2Latent does not require hyper - parameter adjustment or manual operation at the inference stage, thus simplifying the usage process. 2. **Ensuring fast inference speed**: Compared with existing GAN and diffusion model methods, MM2Latent can achieve a faster inference speed. 3. **Supporting real - image editing**: MM2Latent can be used to edit real images, not just synthetic images. ### Technical Details To achieve the above goals, MM2Latent mainly adopts the following technical components: - **StyleGAN2** is used as an image generator, taking advantage of its rich semantically decoupled W - latent space to generate high - quality face images. - **FaRL** is used for text encoding. FaRL is a vision - language joint model that can align text and images to the same feature space. - **Auto - encoders** are used to process spatial modalities (such as masks, sketches, and 3DMM parameters). These auto - encoders can extract the feature representations of these modalities. - **MappingNet (Mapping Network)** is used to map multi - modal inputs into the W - latent space of StyleGAN, thereby achieving multi - modal fusion. ### Training and Inference During the training process, MM2Latent uses the image embeddings of FaRL to generate pseudo - text embeddings to enhance the generalization ability of the model. At the inference stage, real text embeddings can be directly used for multi - modal image generation. For image - editing tasks, MM2Latent can be edited by navigating the semantic directions in the W - latent space, such as making the face look older or adding a beard, etc. ### Experimental Results The paper verifies the effectiveness of MM2Latent through a series of experiments, including: - **Multi - modal consistency evaluation**: CLIP Score and Mask Accuracy are used to evaluate the consistency between the generated image and the text and mask. - **Image quality evaluation**: The CMMD metric is used to evaluate the quality of the generated image. - **Comparative experiments**: Comparisons are made with existing multi - modal generation methods (such as TediGAN, Composal, UniteConquer, and Collaborative Diffusion), and the results show that MM2Latent outperforms other methods in multiple metrics. In general, this paper proposes an efficient and controllable multi - modal face image generation and editing framework, solves the problems existing in existing methods, and shows significant performance improvement.

MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

Consistent Multimodal Generation via A Unified GAN Framework

Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation

Towards Open-World Text-Guided Face Image Generation and Manipulation

MaskFaceGAN: High-Resolution Face Editing With Masked GAN Latent Code Optimization

DMF-GAN: Deep Multimodal Fusion Generative Adversarial Networks for Text-to-Image Synthesis

Multi-Modal Face Stylization with a Generative Prior

OptGAN: Optimizing and Interpreting the Latent Space of the Conditional Text-to-Image GANs

Spatial Latent Representations in Generative Adversarial Networks for Image Generation

Text-to-Face Generation with StyleGAN2

Mask-Guided Portrait Editing With Conditional GANs

Latents2Semantics: Leveraging the Latent Space of Generative Models for Localized Style Manipulation of Face Images

Portrait Video Editing Empowered by Multimodal Generative Priors

T-Person-GAN: Text-to-Person Image Generation with Identity-Consistency and Manifold Mix-Up

CMOS-GAN: Semi-Supervised Generative Adversarial Model for Cross-Modality Face Image Synthesis

Interpreting the Latent Space of GANs for Semantic Face Editing

FACEMUG: A Multimodal Generative and Fusion Framework for Local Facial Editing

Semi-Latent GAN: Learning to Generate and Modify Facial Images from Attributes.

Towards Spatially Disentangled Manipulation of Face Images With Pre-Trained StyleGANs

GANalyzer: Analysis and Manipulation of GANs Latent Space for Controllable Face Synthesis