Abstract:Vision Transformer (ViT) is widely used in the field of computer vision, in ViT, there are four main steps, which are "four secrets", such as patch division, token selection, position encoding addition, attention calculation, the existing research on transformer in computer vision mainly focuses on the above four steps. Therefore, "how to divide patch?", "how to select token?", "how to add position encoding?", and "how to calculate attention?" are crucial to improve ViT performance. But so far, most of the review literatures are summarized from the perspective of application, and there is no corresponding literature to comprehensively summarize these four steps from the technology perspective, which restricts the further development of ViT in some degree. To address the above questions, the 4 major mechanisms and 5 applications of ViT are summarized in this paper, the main innovative works are as follows: Firstly, the basic principle and model structure of ViT are elaborated; Secondly, aiming to "how to divide patch?", the 5 key techniques of patch division mechanism are summarized: from single-size division to multi-size division, from fixed number division to adaptive number division, from non-overlapping division to overlapping division, from semantic segmentation division to semantic aggregation division, and from original image division to feature map division; Thirdly, aiming to "how to select token?", the 3 key techniques of token selection mechanism are summarized: token selection based on score, token selection based on merge, token selection based on convolution and pooling; Fourthly, aiming to "how to add position encoding?", the 5 key techniques of position encoding mechanism are summarized: absolute position encoding, relative position encoding, conditional position encoding, locally-enhanced position encoding, and zero-padding position encoding; Fifthly, aiming to "how to calculate attention?", 18 attention mechanisms are summarized based on the timeline; Sixthly, these models that Transformer is combined with U-Net, GAN, YOLO, ResNet, and DenseNet are discussed in the medical image processing field; Finally, around these four questions proposed in this paper, we look forward to the future development direction of frontier technologies such as patch division mechanism, token selection mechanism, position encoding mechanism, and attention mechanism et al, which play an important role in the further development of ViT.

Peeling Back the Layers: Interpreting the Storytelling of ViT

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

EL-VIT: Probing Vision Transformer with Interactive Visualization

VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Vision transformer: To discover the "four secrets" of image patches

How Does Attention Work in Vision Transformers? A Visual Analytics Attempt

Interpreting and Controlling Vision Foundation Models via Text Explanations

Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

DeepViT: Towards Deeper Vision Transformer

Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training.

Retina Vision Transformer (RetinaViT): Introducing Scaled Patches into Vision Transformers

Vision Transformer: Vit and its Derivatives

Super Vision Transformer

Hierarchical Vision and Language Transformer for Efficient Visual Dialog

AttentionViz: A Global View of Transformer Attention

SimViT: Exploring a Simple Vision Transformer with sliding windows

Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning