DE-GAN: Text-to-image Synthesis with Dual and Efficient Fusion Model

Bin Jiang,Weiyuan Zeng,Chao Yang,Renjun Wang,Bolin Zhang
DOI: https://doi.org/10.1007/s11042-023-16377-8
IF: 2.577
2023-01-01
Multimedia Tools and Applications
Abstract:Generating diverse and plausible images conditioned on the given captions is an attractive but challenging task. While many existing studies have presented impressive results, text-to-image synthesis still suffers from two problems. (1) The fact that noise is only injected at the very beginning hurts the divesity of final results. (2) Most previous models exploit non-local-like spatial attention mechanisms to introduce fine-grained word-level information in the generation process, which makes these models too storage-consuming to apply to mobile and embedded applications. In this paper, we propose a novel Dual and Efficient Fusion Generative Adversarial Newtwork (DE-GAN) to cope with the issues above. To balance the diversity and fidelity of generated images, DE-GAN utilizes Dual Injection Blocks to simultaneously inject noise and text embeddings into the model multiple times during the generation process. In addition, an efficient condition channel attention module is designed in DE-GAN to capture the correlations between text and image modalities to guide the network in refining image features with as little storage overhead as possible, enabling the model to adapt to resource-constrained applications. Comprehensive experiments on two benchmark datasets demonstrate that DE-GAN efficiently generates more diverse and photo-realistic images compared to previous methods.
What problem does this paper attempt to address?