Tf-Gan: Text Feature Fusion Gan for Text-to-Image Generation

Xiaoyan Jiang,Zhijun Fang,Jize Chen,Hamido Fujita
DOI: https://doi.org/10.2139/ssrn.4280052
2022-01-01
SSRN Electronic Journal
Abstract:Generating high-resolution realistic images from text descriptions is a challenging topic in computer vision. Most existing text-to-image generation methods follow the multi-stage generative adversarial network (GAN) framework, which can produce relatively high-resolution images. However, the image quality of current stages relies heavily on the images generated in the previous stages. Moreover, the semantic consistency between text description and the generated image is not guaranteed by the state-of-the-art text-to-image generators. To solve the above problems, we propose a novel architecture called Text Feature Fusion GAN (TF-GAN), emphasizing on local words and global sentence feature. Keywords are extracted in the text description by the sentence fusion attention mechanism (SFAttn) to optimize image features in early stages and provide fine-grained details for the images of the later stage. The conditional fusion block (CFBlock) constraints the generated images in the global semantic level for deep information fusion. Multiple fusions in CFBlock make the network more non-linear, so the model can fuse the sentence feature and the image features more deeply and improve the semantic consistency of the generated images. Extensive experiments and Comparison with other state-of-the-art methods on the Caltech-UCSD Birds 200 dataset and the Microsoft Common Objects in Context dataset show that our generated images are more photo-realistic and closer to the text descriptions.
What problem does this paper attempt to address?