Cross-modal Feature Alignment Based Hybrid Attentional Generative Adversarial Networks for Text-to-image Synthesis

Qingrong Cheng,Xiaodong Gu
DOI: https://doi.org/10.1016/j.dsp.2020.102866
IF: 2.92
2020-01-01
Digital Signal Processing
Abstract:With the development of the generative model, image synthesis has become a research hotspot. This paper presents a novel Cross-modal Feature Alignment based Hybrid Attentional Generative Adversarial Networks (CFA-HAGAN) for text-to-image synthesis. It mainly consists of two steps, text-image encoding and text-to-image synthesis. Text-image encoding learns a Cross-modal Feature Alignment Model (CFAM), which adopts a fine-grained attentional network to learn the original multi-modalities' aligned features. The feature alignment space is viewed as the transitional space in the whole process. Then, the Hybrid Attentional Generative Adversarial Networks (HAGAN) learns the inverse mapping from the encoded text feature to the original image. Specifically, the hybrid attention block consists of text-image cross-modal attention mechanism and self-attention mechanism of an image. Cross-modal attention makes the synthesized image fine-grained by adding word-level information as additional supervision. Self-attention can solve the long-distance reliance problem of image sub-region features when synthesizes images from the hidden feature. Although excellent performance in an ocean of tasks, GANs are well-known for the difficulty of training. Adopting spectral normalization, the discriminators are satisfied with 1-Lipschitz constraint, which makes their training process more stable than original GANs. During quantitative and non-quantitative comparison with many state-of-the-art methods, the experimental results show that the proposed method achieves better performance on evaluation metric and visual effect. Besides, the experimental section presents attention visualization, ablation study, and generalization ability analysis to show the effectiveness of the proposed method.
What problem does this paper attempt to address?