Specific Diverse Text-to-Image Synthesis Via Exemplar Guidance

Ailin Li,Lei Zhao,Zhiwen Zuo,Zhizhong Wang,Wei Xing,Dongming Lu
DOI: https://doi.org/10.1109/mmul.2024.3421243
IF: 3.4911
2024-01-01
IEEE Multimedia
Abstract:This paper investigates an open research task of text-to-image synthesis for generating specific diverse images guided by exemplars. Various conditional Generative Adversarial Networks (cGANs) have been developed to generate images conditioned on the text and add noise for random diversity. In this paper, we desire to accomplish such synthesis for diversity: given a text description and an exemplar, the synthetic image can meet the following two requirements: 1) being realistic and closely align with the text description; 2) adopting the unique style elements of the exemplar that are not explicitly described in the text, to achieve guided diversity. Hence, the model should be able to align image and text features while learning specific image styles from exemplars. To this end, we design a novel end-to-end neural architecture that leverages context-aware cross-attention alignment and adversarial learning along with a specific style retention loss, to optimize the learning of the generator for text-matching and specific diverse image synthesis. The experimental results conducted on the CUB, Oxford-102, and CelebA datasets demonstrated that our method could synthesize specific diverse images with the guidance of various exemplars under the premise of realism and semantic consistency.
What problem does this paper attempt to address?