RATLIP: Generative Adversarial CLIP Text-to-Image Synthesis Based on Recurrent Affine Transformations

Chengde Lin,Xijun Lu,Guangxi Chen
2024-05-14
Abstract:Synthesizing high-quality photorealistic images with textual descriptions as a condition is very challenging. Generative Adversarial Networks (GANs), the classical model for this task, frequently suffer from low consistency between image and text descriptions and insufficient richness in synthesized images. Recently, conditional affine transformations (CAT), such as conditional batch normalization and instance normalization, have been applied to different layers of GAN to control content synthesis in images. CAT is a multi-layer perceptron that independently predicts data based on batch statistics between neighboring layers, with global textual information unavailable to other layers. To address this issue, we first model CAT and a recurrent neural network (RAT) to ensure that different layers can access global information. We then introduce shuffle attention between RAT to mitigate the characteristic of information forgetting in recurrent neural networks. Moreover, both our generator and discriminator utilize the powerful pre-trained model, Clip, which has been extensively employed for establishing associations between text and images through the learning of multimodal representations in latent space. The discriminator utilizes CLIP's ability to comprehend complex scenes to accurately assess the quality of the generated images. Extensive experiments have been conducted on the CUB, Oxford, and CelebA-tiny datasets to demonstrate the superiority of the proposed model over current state-of-the-art models. The code is
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address several key issues in text-to-image synthesis. Specifically: 1. **Improving Image Quality and Consistency**: Existing methods (such as GAN) often encounter problems with low consistency between the generated images and the text descriptions, as well as a lack of richness in the generated images when producing high-quality, realistic images. 2. **Optimizing Long-term Memory and Information Forgetting**: Methods based on Recurrent Neural Networks (RNN) tend to have issues with information forgetting when processing long sequence data. The paper proposes a new Recurrent Affine Transformation module (RAT Block), which combines LSTM and shuffle attention mechanisms to alleviate this problem. 3. **Enhancing Text Fusion Effect**: By introducing global text information, the paper ensures that feature fusion between different layers has consistency and semantic relevance, thereby improving the overall text fusion effect. To address these issues, the authors propose the RATLIP model, which leverages the pre-trained CLIP model to enhance the capabilities of the generator and discriminator, and validate its superior performance on multiple datasets through experiments.