Text-to-Image Generation with Multiscale Semantic Context-Aware Generative Adversarial Networks

Pei Dong,Lei Wu,Lei Meng,Xiangxu Meng
DOI: https://doi.org/10.1007/978-981-97-5615-5_16
2024-01-01
Abstract:Synthesizing complex images from textual descriptions presents a significant challenge. Relying on tremendous training data and model size, recent Diffusion, Autoregressive and GAN models have made significant progress in synthesizing photo-realistic images. However, the high computing budget and hardware requirements due to the large data and model size hinder the flexibility of employing these models. This paper introduces Multiscale Semantic Context-Aware Generative Adversarial Networks (MSCA-GAN), which achieves strong text alignment with limited data while balancing generation quality and efficiency. The proposed MSCA-GAN incorporates innovative modules for textual semantic injection, delivery, and validation. Specifically, the Semantic Adaptive Affine Fusion (SAAF) module dynamically adjusts expression weights of textual semantic information to align with the feature generation process, encompassing global to detailed aspects. Furthermore, the CrossBlock Context Aware Encoding (CCAE) module explicitly establishes semantic context across different synthesis blocks during the local feature delivery. Finally, MSCA-GAN introduces an additional CLIP guidance term to verify the semantic consistency of local features at various scales. MSCA-GAN is pre-trained on the CC3M and CC12M datasets, which only contain limited data. Extensive experiments confirm that the MSCA-GAN performs competitively in terms of image quality and generation efficiency, both quantitatively and qualitatively.
What problem does this paper attempt to address?