AtHom: Two Divergent Attentions Stimulated by Homomorphic Training in Text-to-Image Synthesis

Zhenbo Shi,Zhi Chen,Zhenbo Xu,Wei Yang,Liusheng Huang
DOI: https://doi.org/10.1145/3503161.3548159
2022-01-01
Abstract:Image generation from text is a challenging and ill-posed task. Images generated from previous methods usually have low semantic consistency with texts and the achieved resolution is limited. To generate semantically consistent high-resolution images, we propose a novel method named AtHom, in which two attention modules are developed to extract the relationships from both independent modality and unified modality. The first is a novel Independent Modality Attention Module (IAM), which is presented to find out semantically important areas in generated images and to extract the informative context in texts. The second is a new module named Unified Semantic Space Attention Module (UAM), which is utilized to find out the relationships between extracted text context and essential areas in generated images. In particular, to bring the semantic features of texts and images closer in a unified semantic space, AtHom incorporates a homomorphic training mode by exploiting an extra discriminator to distinguish between two different modalities. Extensive experiments show that our AtHom surpasses previous methods by large margins.
What problem does this paper attempt to address?