Self-Modulated Feature Fusion GAN for Text-to-Image Synthesis

Wenhong Cai,Jinhai Xiang,Bin Hu
DOI: https://doi.org/10.1109/ICEET60227.2023.10526149
2023-01-01
Abstract:Text-to-image synthesis is one of the key tasks in the cross-modal generation, which aims to generate natural and realistic images under the condition of text description. The main challenge of this task is how to efficiently integrate text information into the process of image synthesis while satisfying a high degree of semantic consistency. Existing methods based on generative adversarial networks(GANs) use stacked network structures (2–3 groups of generators and discriminators) to generate high-resolution images in stages and add text information at different stages of generation. Most aforementioned methods may give rise to some issues such as generator interdependence and an unwarranted proliferation of network parameters. To address these limitations, we propose a generative adversarial network architecture with a single generator and discriminator. The Adaptive Semantic Image feature Fusion module in the generator can effectively compensate for the lack of fine-grained information caused by a single generator and discriminator. In addition, to stabilize the network training and enhance the semantic consistency between the text and the synthesized image, local spectral normalization is used in the discriminator, and contrastive loss is added to enhance the discriminator's ability to supervise the synthesis of the generator. Extensive experiments on CUB and COCO datasets demonstrate that the proposed model is superior to the existing models.
What problem does this paper attempt to address?