Abstract:Transformer becomes prevalent in computer vision, especially for high-level vision tasks. However, adopting Transformer in the generative adversarial network (GAN) framework is still an open yet challenging problem. In this paper, we conduct a comprehensive empirical study to investigate the properties of Transformer in GAN for high-fidelity image synthesis. Our analysis highlights and reaffirms the importance of feature locality in image generation, although the merits of the locality are well known in the classification task. Perhaps more interestingly, we find the residual connections in self-attention layers harmful for learning Transformer-based discriminators and conditional generators. We carefully examine the influence and propose effective ways to mitigate the negative impacts. Our study leads to a new alternative design of Transformers in GAN, a convolutional neural network (CNN)-free generator termed as STrans-G, which achieves competitive results in both unconditional and conditional image generations. The Transformer-based discriminator, STrans-D, also significantly reduces its gap against the CNN-based discriminators.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to successfully apply the Transformer architecture to Generative Adversarial Networks (GANs) in order to achieve high - fidelity image generation. Specifically, the paper focuses on the following aspects: 1. **The importance of feature locality**: Although the advantages of feature locality in classification tasks are known, the paper emphasizes the importance of maintaining this locality in image generation. It is found that the global self - attention operation widely used in existing Transformer - based GANs is harmful to the synthesis performance and is computationally too expensive for high - resolution image generation. Therefore, the paper explores several methods to increase locality and finds that the Swin layer is the most effective building block and can provide local inductive bias. 2. **The problem of residual connections in the discriminator**: There is a residual connection around each sub - layer in the Transformer (such as the self - attention layer and the point - wise fully - connected layer). Through detailed norm ratio analysis, the paper finds that these residual connections tend to dominate the information flow, causing the self - attention and fully - connected operation sub - layers in the discriminator to be inadvertently bypassed, thus affecting the training quality and convergence speed. For this reason, the paper proposes an alternative, that is, using a skip - projection layer, to better preserve the information flow in the residual block. 3. **The application of conditional normalization in conditional generation**: For Transformer - based conditional GANs, the traditional method of injecting conditional category information is not effective. The main reason is that a large amount of information flow in the Transformer generator passes through the residual connection. If the conditional information is injected in the main branch, it will be largely ignored and contribute little to the final output. The paper proposes a method of using a conditional normalization layer in the backbone, which helps to preserve the conditional information throughout the Transformer generator. Through the above research, the paper successfully reduces the performance gap between Transformer - based GANs and contemporary CNN - based GANs, especially in the conditional generation setting, which is an area that has been less explored in previous research. The design choices in the paper can be easily implemented without complex architectural modifications, and the proposed model STransGAN performs at or close to the current state - of - the - art level on multiple datasets.

The Nuts and Bolts of Adopting Transformer in GANs

TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up

TcGAN: Semantic-Aware and Structure-Preserved GANs with Individual Vision Transformer for Fast Arbitrary One-Shot Image Generation

Combining Transformer Generators with Convolutional Discriminators

Transformer-based Generative Adversarial Networks in Computer Vision: A Comprehensive Survey

CT-GAN: A conditional Generative Adversarial Network of transformer architecture for text-to-image

ViTGAN: Training GANs with Vision Transformers

Style Transformer for Image Inversion and Editing

Rethinking low-light enhancement via Transformer-GAN

Mechanisms of Generative Image-to-Image Translation Networks

Understanding the Difficulty of Training Transformers

Transformation GAN for Unsupervised Image Synthesis and Representation Learning

Attention-GAN for Object Transfiguration in Wild Images.

SRTransGAN: Image Super-Resolution using Transformer based Generative Adversarial Network

The theoretical research of generative adversarial networks: an overview

Improved Transformer for High-Resolution GANs

Generalized Probabilistic Attention Mechanism in Transformers

Unlocking the Power of GANs in Non-Autoregressive Text Generation

Rob-GAN: Generator, Discriminator, and Adversarial Attacker

Time-series Transformer Generative Adversarial Networks

Recent Advances of Generative Adversarial Networks in Computer Vision