Abstract:Vision transformer (ViT) expands the success of transformer models from sequential data to images. The model decomposes an image into many smaller patches and arranges them into a sequence. Multi-head self-attentions are then applied to the sequence to learn the attention between patches. Despite many successful interpretations of transformers on sequential data, little effort has been devoted to the interpretation of ViTs, and many questions remain unanswered. For example, among the numerous attention heads, which one is more important? How strong are individual patches attending to their spatial neighbors in different heads? What attention patterns have individual heads learned? In this work, we answer these questions through a visual analytics approach. Specifically, we first identify what heads are more important in ViTs by introducing multiple pruning-based metrics. Then, we profile the spatial distribution of attention strengths between patches inside individual heads, as well as the trend of attention strengths across attention layers. Third, using an autoencoder-based learning solution, we summarize all possible attention patterns that individual heads could learn. Examining the attention strengths and patterns of the important heads, we answer why they are important. Through concrete case studies with experienced deep learning experts on multiple ViTs, we validate the effectiveness of our solution that deepens the understanding of ViTs from head importance, head attention strength, and head attention pattern.

The Encoding Method of Position Embeddings in Vision Transformer

What Do Position Embeddings Learn? An Empirical Study of Pre-Trained Language Model Positional Encoding

Rethinking Position Embedding Methods in the Transformer Architecture

Conditional Positional Encodings for Vision Transformers

A bio-inspired positional embedding network for transformer-based models

Positional Label for Self-Supervised Vision Transformer

Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields

Vision Transformer: Vit and its Derivatives

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

How Does Attention Work in Vision Transformers? A Visual Analytics Attempt

Do Vision Transformers See Like Convolutional Neural Networks?

Rethinking and Improving Relative Position Encoding for Vision Transformer

On the Surprising Effectiveness of Attention Transfer for Vision Transformers

Vision Big Bird: Random Sparsification for Full Attention

Convolutional Embedding Makes Hierarchical Vision Transformer Stronger

$E(2)$-Equivariant Vision Transformer

A Simple and Effective Positional Encoding for Transformers

Improve Transformer Models with Better Relative Position Embeddings

Analyzing Vision Transformers for Image Classification in Class Embedding Space

Transformer with token attention and attribute prediction for image captioning

RegionViT: Regional-to-Local Attention for Vision Transformers