ViTGAN: Training GANs with Vision Transformers

Kwonjoon Lee,Huiwen Chang,Lu Jiang,Han Zhang,Zhuowen Tu,Ce Liu
2024-05-29
Abstract:Recently, Vision Transformers (ViTs) have shown competitive performance on image recognition while requiring less vision-specific inductive biases. In this paper, we investigate if such performance can be extended to image generation. To this end, we integrate the ViT architecture into generative adversarial networks (GANs). For ViT discriminators, we observe that existing regularization methods for GANs interact poorly with self-attention, causing serious instability during training. To resolve this issue, we introduce several novel regularization techniques for training GANs with ViTs. For ViT generators, we examine architectural choices for latent and pixel mapping layers to facilitate convergence. Empirically, our approach, named ViTGAN, achieves comparable performance to the leading CNN-based GAN models on three datasets: CIFAR-10, CelebA, and LSUN bedroom.
Computer Vision and Pattern Recognition,Machine Learning,Image and Video Processing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: **How to use Vision Transformers (ViTs) for image generation without using convolution or pooling, and train Generative Adversarial Networks (GANs) to achieve a quality comparable to that of GAN models based on Convolutional Neural Networks (CNNs)**. Specifically, the paper explores the following aspects: 1. **Whether ViTs can be used for image generation tasks**: The author attempts to integrate the ViT architecture into the Generative Adversarial Network to explore the performance of ViTs in image generation. 2. **The stability problem when ViTs are used as discriminators**: The author finds that the existing GAN regularization methods interact poorly with the self - attention mechanism, resulting in serious instability during the training process. For this reason, they introduce several new regularization techniques to stabilize the training process. 3. **The architecture design when ViTs are used as generators**: The author studies the architecture choices of the latent mapping layer and the pixel mapping layer to promote the convergence of the model. 4. **Performance verification**: Through experiments on the CIFAR - 10, CelebA and LSUN bedroom datasets, it is verified that the proposed method (named ViTGAN) can achieve performance comparable to that of the leading CNN - based GAN models. ### Main contributions - **Proposing the ViTGAN model**: This model combines the ViT architecture and the GAN framework and can achieve performance comparable to that of CNN - based GAN models in image generation tasks. - **Solving the stability problem of ViTs in GAN**: By introducing new regularization techniques and an improved spectral normalization method, the instability problem of ViTs in GAN training is effectively solved. - **Optimizing the architectures of the generator and the discriminator**: By adjusting the architectures of the generator and the discriminator, the training efficiency and the generation quality of the model are improved. ### Experimental results - **Quantitative evaluation**: On the CIFAR - 10, CelebA and LSUN bedroom datasets, the FID score and the IS score of ViTGAN have reached a level comparable to that of the leading CNN - based GAN models. - **Qualitative comparison**: The quality and diversity of the generated images are comparable to those of advanced models such as StyleGAN2, and even perform excellently in some aspects. ### Conclusion This paper successfully demonstrates the potential of ViTs in image generation tasks and overcomes the stability problem of ViTs in GAN training through a series of technical improvements. These achievements provide an important reference for future Transformer - based image generation research.