ISF-GAN: Imagine, Select, and Fuse with GPT-Based Text Enrichment for Text-to-Image Synthesis
Yefei Sheng,Ming Tao,Jie Wang,Bing-Kun Bao
DOI: https://doi.org/10.1145/3650033
2024-03-28
Abstract:Text-to-Image synthesis aims to generate an accurate and semantically consistent image from a given text description. However, it is difficult for existing generative methods to generate semantically complete images from a single piece of text. Some works try to expand the input text to multiple captions via retrieving similar descriptions of the input text from the training set, but still fail to fill in missing image semantics. In this paper, we propose a GAN-based approach to Imagine, Select, and Fuse for Text-to-Image synthesis, named ISF-GAN. The proposed ISF-GAN contains Imagine Stage and Select and Fuse Stage to solve the above problems. First, the Imagine Stage proposes a text completion and enrichment module. This module guides a GPT-based model to enrich the text expression beyond the original dataset. Second, the Select and Fuse Stage selects qualified text descriptions, and then introduces a cross-modal attentional mechanism to interact these different sentences with the image features at different scales. In short, our proposed model enriches the input text information for completing missing semantics and introduces a cross-modal attentional mechanism to maximize the utilization of enriched text information to generate semantically consistent images. Experimental results on CUB, Oxford-102, and CelebA-HQ datasets prove the effectiveness and superiority of the proposed network. Code is available at https://github.com/Feilingg/ISF-GAN.
computer science, information systems, theory & methods, software engineering