Abstract:Recently, Vision Transformers (ViTs) have shown competitive performance on image recognition while requiring less vision-specific inductive biases. In this paper, we investigate if such performance can be extended to image generation. To this end, we integrate the ViT architecture into generative adversarial networks (GANs). For ViT discriminators, we observe that existing regularization methods for GANs interact poorly with self-attention, causing serious instability during training. To resolve this issue, we introduce several novel regularization techniques for training GANs with ViTs. For ViT generators, we examine architectural choices for latent and pixel mapping layers to facilitate convergence. Empirically, our approach, named ViTGAN, achieves comparable performance to the leading CNN-based GAN models on three datasets: CIFAR-10, CelebA, and LSUN bedroom.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: **How to use Vision Transformers (ViTs) for image generation without using convolution or pooling, and train Generative Adversarial Networks (GANs) to achieve a quality comparable to that of GAN models based on Convolutional Neural Networks (CNNs)**. Specifically, the paper explores the following aspects: 1. **Whether ViTs can be used for image generation tasks**: The author attempts to integrate the ViT architecture into the Generative Adversarial Network to explore the performance of ViTs in image generation. 2. **The stability problem when ViTs are used as discriminators**: The author finds that the existing GAN regularization methods interact poorly with the self - attention mechanism, resulting in serious instability during the training process. For this reason, they introduce several new regularization techniques to stabilize the training process. 3. **The architecture design when ViTs are used as generators**: The author studies the architecture choices of the latent mapping layer and the pixel mapping layer to promote the convergence of the model. 4. **Performance verification**: Through experiments on the CIFAR - 10, CelebA and LSUN bedroom datasets, it is verified that the proposed method (named ViTGAN) can achieve performance comparable to that of the leading CNN - based GAN models. ### Main contributions - **Proposing the ViTGAN model**: This model combines the ViT architecture and the GAN framework and can achieve performance comparable to that of CNN - based GAN models in image generation tasks. - **Solving the stability problem of ViTs in GAN**: By introducing new regularization techniques and an improved spectral normalization method, the instability problem of ViTs in GAN training is effectively solved. - **Optimizing the architectures of the generator and the discriminator**: By adjusting the architectures of the generator and the discriminator, the training efficiency and the generation quality of the model are improved. ### Experimental results - **Quantitative evaluation**: On the CIFAR - 10, CelebA and LSUN bedroom datasets, the FID score and the IS score of ViTGAN have reached a level comparable to that of the leading CNN - based GAN models. - **Qualitative comparison**: The quality and diversity of the generated images are comparable to those of advanced models such as StyleGAN2, and even perform excellently in some aspects. ### Conclusion This paper successfully demonstrates the potential of ViTs in image generation tasks and overcomes the stability problem of ViTs in GAN training through a series of technical improvements. These achievements provide an important reference for future Transformer - based image generation research.

ViTGAN: Training GANs with Vision Transformers

How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up

UVCGAN: UNet Vision Transformer cycle-consistent GAN for unpaired image-to-image translation

Conditioned Generative Transformers for Histopathology Image Synthetic Augmentation

Your ViT is Secretly a Hybrid Discriminative-Generative Diffusion Model

Adaptive Attention Link-based Regularization for Vision Transformers

The Nuts and Bolts of Adopting Transformer in GANs

Convolutional Embedding Makes Hierarchical Vision Transformer Stronger

Towards Efficient Adversarial Training on Vision Transformers

Denoising Vision Transformers

GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets

When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture

Exploring Vision Transformers as Diffusion Learners

A New Perspective on Stabilizing GANs Training: Direct Adversarial Training

Combining Transformer Generators with Convolutional Discriminators

GIVT: Generative Infinite-Vocabulary Transformers

Improving transferable adversarial attack for vision transformers via global attention and local drop

All are Worth Words: A ViT Backbone for Diffusion Models

DctViT: Discrete Cosine Transform Meet Vision Transformers

Effective Vision Transformer Training: A Data-Centric Perspective