SAM-GAN: Self-Attention supporting Multi-stage Generative Adversarial Networks for text-to-image synthesis

Dunlu Peng,Wuchen Yang,Cong Liu,Shuairui Lü
DOI: https://doi.org/10.1016/j.neunet.2021.01.023
IF: 7.8
2021-06-01
Neural Networks
Abstract:<p>Synthesizing photo-realistic images based on text descriptions is a challenging task in the field of computer vision. Although generative adversarial networks have made significant breakthroughs in this task, they still face huge challenges in generating high-quality visually realistic images consistent with the semantics of text. Generally, existing text-to-image methods accomplish this task with two steps, that is, first generating an initial image with a rough outline and color, and then gradually yielding the image within high-resolution from the initial image. However, one drawback of these methods is that, if the quality of the initial image generation is not high, it is hard to generate a satisfactory high-resolution image. In this paper, we propose SAM-GAN, Self-Attention supporting Multi-stage Generative Adversarial Networks, for text-to-image synthesis. With the self-attention mechanism, the model can establish the multi-level dependence of the image and fuse the sentence- and word-level visual-semantic vectors, to improve the quality of the generated image. Furthermore, a multi-stage perceptual loss is introduced to enhance the semantic similarity between the synthesized image and the real image, thus enhancing the visual-semantic consistency between text and images. For the diversity of the generated images, a mode seeking regularization term is integrated into the model. The results of extensive experiments and ablation studies, which were conducted in the Caltech-UCSD Birds and Microsoft Common Objects in Context datasets, show that our model is superior to competitive models in text-to-image synthesis.</p>
computer science, artificial intelligence,neurosciences
What problem does this paper attempt to address?
The paper aims to address the problem of generating realistic images based on textual descriptions. Specifically, the authors propose a new Generative Adversarial Network (GAN) model—SAM-GAN (Self-Attention supported Multi-stage Generative Adversarial Network) to improve the quality of text-to-image synthesis. The main objectives include: 1. **Improving initial image quality**: By introducing a self-attention mechanism to enhance long-range dependencies in the first stage of the GAN, thereby improving the quality of the initial image. 2. **Enhancing semantic consistency**: Introducing multi-stage perceptual loss to progressively optimize the similarity between generated images and real images, not only at the pixel level but also at higher-level semantic features. 3. **Increasing diversity**: To prevent mode collapse and ensure the diversity of generated images, a mode-seeking regularization term is proposed to fully utilize noise vectors to generate different images. Extensive experiments and ablation studies on the Caltech-UCSD Birds 200 (CUB) and Microsoft Common Objects in Context (COCO) datasets demonstrate that the proposed model outperforms existing methods in the text-to-image synthesis task.