Abstract:This paper proposes the first pure Transformer structure inversion network called SwinStyleformer, which can compensate for the shortcomings of the CNNs inversion framework by handling long-range dependencies and learning the global structure of objects. Experiments found that the inversion network with the Transformer backbone could not successfully invert the image. The above phenomena arise from the differences between CNNs and Transformers, such as the self-attention weights favoring image structure ignoring image details compared to convolution, the lack of multi-scale properties of Transformer, and the distribution differences between the latent code extracted by the Transformer and the StyleGAN style vector. To address these differences, we employ the Swin Transformer with a smaller window size as the backbone of the SwinStyleformer to enhance the local detail of the inversion image. Meanwhile, we design a Transformer block based on learnable queries. Compared to the self-attention transformer block, the Transformer block based on learnable queries provides greater adaptability and flexibility, enabling the model to update the attention weights according to specific tasks. Thus, the inversion focus is not limited to the image structure. To further introduce multi-scale properties, we design multi-scale connections in the extraction of feature maps. Multi-scale connections allow the model to gain a comprehensive understanding of the image to avoid loss of detail due to global modeling. Moreover, we propose an inversion discriminator and distribution alignment loss to minimize the distribution differences. Based on the above designs, our SwinStyleformer successfully solves the Transformer's inversion failure issue and demonstrates SOTA performance in image inversion and several related vision tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **In the image inversion task, how to overcome the limitations of the Transformer structure compared to the CNN structure in order to achieve high - quality image inversion**. Specifically, the existing CNN - based image inversion algorithms have some inherent drawbacks, such as difficulty in capturing long - distance dependencies and learning the global structure of objects, which will affect the quality of the inverted image. Although the Transformer has advantages in dealing with long - distance dependencies and global structures, difficulties are encountered when directly applying the Transformer to image inversion, mainly in the following three aspects: 1. **Self - attention weights are biased towards the image structure and ignore local details**: Compared with convolution, the self - attention mechanism of the Transformer is more inclined to focus on the overall structure of the image and ignores local details. 2. **Lack of multi - scale characteristics**: CNN has a powerful multi - scale design and can capture information at different granularities, thus balancing the global structure and local details. However, the Transformer is deficient in this regard. 3. **Differences in latent code distributions**: There are differences in the distributions between the latent codes extracted by the Transformer and the StyleGAN style vectors. To solve these problems, the authors propose **SwinStyleformer**, which is the first pure - Transformer - structured image inversion network. By introducing the Swin Transformer with a small window size as the backbone network and a Transformer module based on learnable queries, SwinStyleformer enhances the ability to model local details. In addition, mechanisms such as multi - scale connections, distribution - alignment losses, and inversion discriminators are also designed to further improve the quality and robustness of image inversion. In summary, this paper aims to solve the deficiencies of the Transformer structure in the image inversion task by improving it and demonstrates the excellent performance of SwinStyleformer in image inversion and related visual tasks.

SwinStyleformer is a favorable choice for image inversion

Style Transformer for Image Inversion and Editing

SwinIR: Image Restoration Using Swin Transformer

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Improved deep learning image classification algorithm based on Swin Transformer V2

Swin transformer and ResNet based deep networks for low-light image enhancement

SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection

Degenerate Swin to Win: Plain Window-based Transformer without Sophisticated Operations

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

SwinVI:3D Swin Transformer Model with U-net for Video Inpainting.

SwinSUNet: Pure Transformer Network for Remote Sensing Image Change Detection

Swin-GAN: Generative Adversarial Network Based on Shifted Windows Transformer Architecture for Image Generation

SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images

SwinHCST: a deep learning network architecture for scene classification of remote sensing images based on improved CNN and Transformer

Swin-Free: Achieving Better Cross-Window Attention and Efficiency with Size-varying Window

Cas-VSwin transformer: A variant swin transformer for surface-defect detection

Residual Swin Transformer Channel Attention Network for Image Demosaicing

Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition