Generating Distinctive Facial Images from Natural Language Descriptions Via Spatial Map Fusion

Qi Guo,Xiaodong Gu
DOI: https://doi.org/10.1007/978-3-031-44192-9_7
2023-01-01
Abstract:Due to the abstract nature of language, creating accurate visual representations of faces using textual descriptions is a complex task. To overcome this challenge, we propose a novel approach called the Spatial-Text Semantic Fusion GAN (STSF-GAN) network that leverages multiple descriptions to generate distinct facial features. Our proposed method includes a new module called the Spatial Map Merge module, which predicts masks as the spatial condition to refine image feature maps based on textual semantics. Additionally, we introduce an attention mechanism called the Local Semantic Attention module that utilizes the potential distribution of each word in the description to compute local attention. Our experiments on Multi-Modal CelebA-HQ and CelebAText-HQ dataset demonstrate the effectiveness of our proposed approach.
What problem does this paper attempt to address?