SAM-GAN: Self-Attention supporting Multi-stage Generative Adversarial Networks for text-to-image synthesis

Dunlu Peng,Wuchen Yang,Cong Liu,Shuairui Lü

DOI: https://doi.org/10.1016/j.neunet.2021.01.023

IF: 7.8

2021-06-01

Neural Networks

Abstract:<p>Synthesizing photo-realistic images based on text descriptions is a challenging task in the field of computer vision. Although generative adversarial networks have made significant breakthroughs in this task, they still face huge challenges in generating high-quality visually realistic images consistent with the semantics of text. Generally, existing text-to-image methods accomplish this task with two steps, that is, first generating an initial image with a rough outline and color, and then gradually yielding the image within high-resolution from the initial image. However, one drawback of these methods is that, if the quality of the initial image generation is not high, it is hard to generate a satisfactory high-resolution image. In this paper, we propose SAM-GAN, Self-Attention supporting Multi-stage Generative Adversarial Networks, for text-to-image synthesis. With the self-attention mechanism, the model can establish the multi-level dependence of the image and fuse the sentence- and word-level visual-semantic vectors, to improve the quality of the generated image. Furthermore, a multi-stage perceptual loss is introduced to enhance the semantic similarity between the synthesized image and the real image, thus enhancing the visual-semantic consistency between text and images. For the diversity of the generated images, a mode seeking regularization term is integrated into the model. The results of extensive experiments and ablation studies, which were conducted in the Caltech-UCSD Birds and Microsoft Common Objects in Context datasets, show that our model is superior to competitive models in text-to-image synthesis.</p>

computer science, artificial intelligence,neurosciences

What problem does this paper attempt to address?

The paper aims to address the problem of generating realistic images based on textual descriptions. Specifically, the authors propose a new Generative Adversarial Network (GAN) model—SAM-GAN (Self-Attention supported Multi-stage Generative Adversarial Network) to improve the quality of text-to-image synthesis. The main objectives include: 1. **Improving initial image quality**: By introducing a self-attention mechanism to enhance long-range dependencies in the first stage of the GAN, thereby improving the quality of the initial image. 2. **Enhancing semantic consistency**: Introducing multi-stage perceptual loss to progressively optimize the similarity between generated images and real images, not only at the pixel level but also at higher-level semantic features. 3. **Increasing diversity**: To prevent mode collapse and ensure the diversity of generated images, a mode-seeking regularization term is proposed to fully utilize noise vectors to generate different images. Extensive experiments and ablation studies on the Caltech-UCSD Birds 200 (CUB) and Microsoft Common Objects in Context (COCO) datasets demonstrate that the proposed model outperforms existing methods in the text-to-image synthesis task.

SAM-GAN: Self-Attention supporting Multi-stage Generative Adversarial Networks for text-to-image synthesis

SIMGAN: Photo-Realistic Semantic Image Manipulation Using Generative Adversarial Networks.

Specific Diverse Text-to-Image Synthesis Via Exemplar Guidance

Diversified text-to-image generation via deep mutual information estimation

DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis

Multi-Sentence Auxiliary Adversarial Networks for Fine-Grained Text-to-Image Synthesis

SAW-GAN: Multi-granularity Text Fusion Generative Adversarial Networks for text-to-image generation

DMF-GAN: Deep Multimodal Fusion Generative Adversarial Networks for Text-to-Image Synthesis

DGattGAN: Cooperative Up-Sampling Based Dual Generator Attentional GAN on Text-to-Image Synthesis

Word self-update contrastive adversarial networks for text-to-image synthesis

DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis.

Perceptual Pyramid Adversarial Networks for Text-to-Image Synthesis.

Customizable GAN: Customizable Image Synthesis Based on Adversarial Learning.

Object-driven Text-to-Image Synthesis via Adversarial Training

Dual Attention GANs for Semantic Image Synthesis

Adaptive Forgetting, Drafting and Comprehensive Guiding: Text-to-Image Synthesis with Hierarchical Generative Adversarial Networks

Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis.

R-GAN: Exploring Human-like Way for Reasonable Text-to-Image Synthesis via Generative Adversarial Networks

Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks

Masked cross-attention and multi-head channel attention guiding single-stage generative adversarial networks for text-to-image generation

DSE-GAN: Dynamic Semantic Evolution Generative Adversarial Network for Text-to-Image Generation