An improved StyleGAN-based TextToFace model with Local-Global information Fusion

Qi Guo,Xiaodong Gu
DOI: https://doi.org/10.1016/j.eswa.2024.123698
IF: 8.5
2024-04-15
Expert Systems with Applications
Abstract:TextToFace is an essential and challenging task in computer vision, which aims to synthesize realistic images from given description. Most current generative networks tailored for face synthesis predominantly employ global-level semantic embedding based on sentence representations. This approach tends to overlook the intricate details encapsulated within the data, often resulting in a loss of facial specifics described in the text and reduced diversity in the generated images. To address this limitation, we introduce an improved StyleGAN-based TextToFace model with Local-Global information Fusion (LGiF), specifically designed to harness fine-grained semantic nuances to refine facial feature generation. To achieve this, the BERT encoder is leveraged to embed textual information into the latent space of StyleGAN, facilitating the automated learning of prominent facial attribute characteristics. We further design an Attention-based Semantic Mapping Network to not only enrich facial diversity but also enhance the fidelity of the synthesized faces. A concurrent Similarity-based Classification Network is employed to determine the global cross-modal similarity, ensuring consistent identity representation. Effective StyleGAN is used as the face generator to synthesize high-quality faces. Our method outperforms the baseline with comprehensive experimental validation on the Multi-Modal CelebA-HQ dataset. Remarkably, LGiF competes favorably with contemporary state-of-the-art techniques, achieving exemplary results in both FSS and MS-SSIM metrics.
computer science, artificial intelligence,engineering, electrical & electronic,operations research & management science
What problem does this paper attempt to address?