Face Synthesis from Visual Attributes via Sketch using Conditional VAEs and GANs

Xing Di,Vishal M. Patel
DOI: https://doi.org/10.48550/arXiv.1801.00077
2017-12-30
Abstract:Automatic synthesis of faces from visual attributes is an important problem in computer vision and has wide applications in law enforcement and entertainment. With the advent of deep generative convolutional neural networks (CNNs), attempts have been made to synthesize face images from attributes and text descriptions. In this paper, we take a different approach, where we formulate the original problem as a stage-wise learning problem. We first synthesize the facial sketch corresponding to the visual attributes and then we reconstruct the face image based on the synthesized sketch. The proposed Attribute2Sketch2Face framework, which is based on a combination of deep Conditional Variational Autoencoder (CVAE) and Generative Adversarial Networks (GANs), consists of three stages: (1) Synthesis of facial sketch from attributes using a CVAE architecture, (2) Enhancement of coarse sketches to produce sharper sketches using a GAN-based framework, and (3) Synthesis of face from sketch using another GAN-based network. Extensive experiments and comparison with recent methods are performed to verify the effectiveness of the proposed attribute-based three stage face synthesis method.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the automatic synthesis of facial images from visual attributes. Specifically, the paper proposes a three - stage framework based on Conditional Variational Auto - Encoder (CVAE) and Generative Adversarial Network (GAN), named Attribute2Sketch2Face (A2S2F), for generating high - quality facial images from given facial attributes. This process is divided into three stages: 1. **Attribute - to - Sketch (A2S)**: - Use the CVAE architecture to generate facial sketches from visual attributes. - The goal of this stage is to generate rough facial sketches from texture attributes and noise vectors. 2. **Sketch - to - Sketch (S2S)**: - Use a GAN - based framework to further enhance the rough sketches generated in the A2S stage to generate clearer sketches. - This stage uses the AUDeNet (Attribute - preserving Dense UNet) generator, which combines the advantages of UNet and DenseNet to improve the quality of sketches. 3. **Sketch - to - Face (S2F)**: - Use another GAN - based framework to generate facial images from the enhanced sketches. - The generator in this stage combines texture and color attributes to generate high - quality facial images. Through these three stages, the paper aims to generate high - quality facial images from given facial attributes, and this technology has broad application prospects in fields such as law enforcement and entertainment. For example, in the absence of a suspect's facial image, the suspect's facial image can be generated by describing the suspect's characteristics to assist in criminal investigations. ### Formula Summary - **Variational Lower Bound of CVAE**: \[ L_{\text{CVAE}}(x, y; \theta, \phi)=-\text{KL}(q_\phi(z | x, y) \| p_\theta(z))+\mathbb{E}_{z \sim q_\phi(z | x, y)}[\log p_\theta(x | y, z)] \] - **Objective Function of Conditional GAN**: \[ L_{\text{cGAN}}(G, D)=\mathbb{E}_{x, y \sim P_{\text{data}}(x, y)}[\log D(x, y)]+\mathbb{E}_{x \sim P_{\text{data}}(x), z \sim p_z(z)}[\log(1 - D(x, G(x, z)))] \] - **Loss Function of A2S Stage**: \[ L_{\text{A2S}} = L_{\text{CVAE}}(s, a; \phi, \theta)-\lambda \text{KL}(q_\beta(z | n, a) \| p_\theta(z)) \] - **Loss Function of S2S Stage**: \[ L = L_{\text{A}}+\lambda_1 L_1+\lambda_2 L_{\text{perp}} \] where: - \( L_{\text{A}} \) is the adversarial loss - \( L_1 \) is the loss based on the L1 - norm - \( L_{\text{perp}} \) is the perceptual loss - **Perceptual Loss**: \[ L_{\text{perp}}=\| V(s_g)-V(s) \|_1 \] where \( V \) represents the feature representation of a certain layer of the pre - trained VGG - 16 network. Through these methods and techniques, the paper successfully solves the problem of generating high - quality facial images from visual attributes and has been verified on multiple datasets, demonstrating its effectiveness and superiority.