RATLIP: Generative Adversarial CLIP Text-to-Image Synthesis Based on Recurrent Affine Transformations

Chengde Lin,Xijun Lu,Guangxi Chen

2024-05-14

Abstract:Synthesizing high-quality photorealistic images with textual descriptions as a condition is very challenging. Generative Adversarial Networks (GANs), the classical model for this task, frequently suffer from low consistency between image and text descriptions and insufficient richness in synthesized images. Recently, conditional affine transformations (CAT), such as conditional batch normalization and instance normalization, have been applied to different layers of GAN to control content synthesis in images. CAT is a multi-layer perceptron that independently predicts data based on batch statistics between neighboring layers, with global textual information unavailable to other layers. To address this issue, we first model CAT and a recurrent neural network (RAT) to ensure that different layers can access global information. We then introduce shuffle attention between RAT to mitigate the characteristic of information forgetting in recurrent neural networks. Moreover, both our generator and discriminator utilize the powerful pre-trained model, Clip, which has been extensively employed for establishing associations between text and images through the learning of multimodal representations in latent space. The discriminator utilizes CLIP's ability to comprehend complex scenes to accurately assess the quality of the generated images. Extensive experiments have been conducted on the CUB, Oxford, and CelebA-tiny datasets to demonstrate the superiority of the proposed model over current state-of-the-art models. The code is

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address several key issues in text-to-image synthesis. Specifically: 1. **Improving Image Quality and Consistency**: Existing methods (such as GAN) often encounter problems with low consistency between the generated images and the text descriptions, as well as a lack of richness in the generated images when producing high-quality, realistic images. 2. **Optimizing Long-term Memory and Information Forgetting**: Methods based on Recurrent Neural Networks (RNN) tend to have issues with information forgetting when processing long sequence data. The paper proposes a new Recurrent Affine Transformation module (RAT Block), which combines LSTM and shuffle attention mechanisms to alleviate this problem. 3. **Enhancing Text Fusion Effect**: By introducing global text information, the paper ensures that feature fusion between different layers has consistency and semantic relevance, thereby improving the overall text fusion effect. To address these issues, the authors propose the RATLIP model, which leverages the pre-trained CLIP model to enhance the capabilities of the generator and discriminator, and validate its superior performance on multiple datasets through experiments.

RATLIP: Generative Adversarial CLIP Text-to-Image Synthesis Based on Recurrent Affine Transformations

Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge.

Recurrent Affine Transformation for Text-to-image Synthesis

RAT-Cycle-GAN for text to image

GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis

CgT-GAN: CLIP-guided Text GAN for Image Captioning

Text to Photo-Realistic Image Synthesis Via Chained Deep Recurrent Generative Adversarial Network.

Perceptual Pyramid Adversarial Networks for Text-to-Image Synthesis.

KT-GAN: Knowledge-Transfer Generative Adversarial Network for Text-to-Image Synthesis

CT-GAN: A conditional Generative Adversarial Network of transformer architecture for text-to-image

OptGAN: Optimizing and Interpreting the Latent Space of the Conditional Text-to-Image GANs

GACnet-Text-to-Image Synthesis With Generative Models Using Attention Mechanisms With Contrastive Learning

Language-vision Matching for Text-to-image Synthesis with Context-Aware GAN

Dualattn-GAN: Text to Image Synthesis with Dual Attentional Generative Adversarial Network.

R-GAN: Exploring Human-likeWay for Reasonable Text-to-Image Synthesis Via Generative Adversarial Networks

Hybrid Attention Driven Text-To-Image Synthesis Via Generative Adversarial Networks

DTGAN: Dual Attention Generative Adversarial Networks for Text-to-Image Generation

Learning to Draw Text in Natural Images with Conditional Adversarial Networks

SS-GANs: Text-to-Image via Stage by Stage Generative Adversarial Networks

R-GAN: Exploring Human-like Way for Reasonable Text-to-Image Synthesis via Generative Adversarial Networks

Controllable Text-to-Image Generation with Enhanced Text Encoder and Edge-Preserving Embedding