The Nuts and Bolts of Adopting Transformer in GANs

Rui Xu,Xiangyu Xu,Kai Chen,Bolei Zhou,Chen Change Loy
2023-06-13
Abstract:Transformer becomes prevalent in computer vision, especially for high-level vision tasks. However, adopting Transformer in the generative adversarial network (GAN) framework is still an open yet challenging problem. In this paper, we conduct a comprehensive empirical study to investigate the properties of Transformer in GAN for high-fidelity image synthesis. Our analysis highlights and reaffirms the importance of feature locality in image generation, although the merits of the locality are well known in the classification task. Perhaps more interestingly, we find the residual connections in self-attention layers harmful for learning Transformer-based discriminators and conditional generators. We carefully examine the influence and propose effective ways to mitigate the negative impacts. Our study leads to a new alternative design of Transformers in GAN, a convolutional neural network (CNN)-free generator termed as STrans-G, which achieves competitive results in both unconditional and conditional image generations. The Transformer-based discriminator, STrans-D, also significantly reduces its gap against the CNN-based discriminators.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to successfully apply the Transformer architecture to Generative Adversarial Networks (GANs) in order to achieve high - fidelity image generation. Specifically, the paper focuses on the following aspects: 1. **The importance of feature locality**: Although the advantages of feature locality in classification tasks are known, the paper emphasizes the importance of maintaining this locality in image generation. It is found that the global self - attention operation widely used in existing Transformer - based GANs is harmful to the synthesis performance and is computationally too expensive for high - resolution image generation. Therefore, the paper explores several methods to increase locality and finds that the Swin layer is the most effective building block and can provide local inductive bias. 2. **The problem of residual connections in the discriminator**: There is a residual connection around each sub - layer in the Transformer (such as the self - attention layer and the point - wise fully - connected layer). Through detailed norm ratio analysis, the paper finds that these residual connections tend to dominate the information flow, causing the self - attention and fully - connected operation sub - layers in the discriminator to be inadvertently bypassed, thus affecting the training quality and convergence speed. For this reason, the paper proposes an alternative, that is, using a skip - projection layer, to better preserve the information flow in the residual block. 3. **The application of conditional normalization in conditional generation**: For Transformer - based conditional GANs, the traditional method of injecting conditional category information is not effective. The main reason is that a large amount of information flow in the Transformer generator passes through the residual connection. If the conditional information is injected in the main branch, it will be largely ignored and contribute little to the final output. The paper proposes a method of using a conditional normalization layer in the backbone, which helps to preserve the conditional information throughout the Transformer generator. Through the above research, the paper successfully reduces the performance gap between Transformer - based GANs and contemporary CNN - based GANs, especially in the conditional generation setting, which is an area that has been less explored in previous research. The design choices in the paper can be easily implemented without complex architectural modifications, and the proposed model STransGAN performs at or close to the current state - of - the - art level on multiple datasets.