Text-to-face Synthesis Based on Facial Landmarks Prediction
Kun Wang,Lei Chen,Biwei Cao,Bo Liu,Jiuxin Cao
DOI: https://doi.org/10.1007/s00138-024-01624-1
IF: 2.983
2024-01-01
Machine Vision and Applications
Abstract:The human face, being one of the most prominent physical features, plays a crucial role in appearance description and recognition. Consequently, text-to-face synthesis has garnered increasing interest in the research community, with applications in criminal investigation, image editing, and more. Compared to text-to-image synthesis, generating facial images from text requires more specialized knowledge due to the subjectivity and diversity of facial descriptions, which involve more fine-grained appearance features. In this paper, we propose a text-to-face synthesis model based on Facial Landmarks Prediction (FLP-GAN). Specifically, we design two foundational submodules to facilitate the generation task. First, a co-attention mechanism is employed to pretrain the image and text encoders to extract features related to facial information. Second, a facial landmarks prediction model is proposed to generate face segment maps based on descriptive text, providing facial semantic prior knowledge for the subsequent face synthesis process. Conditioned on the semantic features obtained from the submodules, we construct the text-to-face synthesis model, which incorporates a memory network and a segment fuse layer to highlight important text information and refine the features. Additionally, a multi-stage refinement process is designed to generate high-resolution face images. Experimental results on the Face2Text dataset demonstrate that our FLP-GAN model outperforms the state-of-the-art methods in both qualitative and quantitative evaluations. Specifically, our model achieved a 22.7