Abstract:Recently, a considerable number of studies in computer vision involves deep neural architectures called vision transformers. Visual processing in these models incorporates computational models that are claimed to implement attention mechanisms. Despite an increasing body of work that attempts to understand the role of attention mechanisms in vision transformers, their effect is largely unknown. Here, we asked if the attention mechanisms in vision transformers exhibit similar effects as those known in human visual attention. To answer this question, we revisited the attention formulation in these models and found that despite the name, computationally, these models perform a special class of relaxation labeling with similarity grouping effects. Additionally, whereas modern experimental findings reveal that human visual attention involves both feed-forward and feedback mechanisms, the purely feed-forward architecture of vision transformers suggests that attention in these models will not have the same effects as those known in humans. To quantify these observations, we evaluated grouping performance in a family of vision transformers. Our results suggest that self-attention modules group figures in the stimuli based on similarity in visual features such as color. Also, in a singleton detection experiment as an instance of saliency detection, we studied if these models exhibit similar effects as those of feed-forward visual salience mechanisms utilized in human visual attention. We found that generally, the transformer-based attention modules assign more salience either to distractors or the ground. Together, our study suggests that the attention mechanisms in vision transformers perform similarity grouping and not attention.

What problem does this paper attempt to address?

This paper attempts to explore whether the self - attention mechanism in Vision Transformer truly realizes the function similar to human visual attention. Specifically, the author raises the following questions: 1. **Does the self - attention mechanism in Vision Transformer work like human visual attention?** - Although many studies claim that Vision Transformer realizes the attention function through the self - attention mechanism, it is still unclear whether the effects of these models are consistent with the mechanism of human visual attention. 2. **What operations does the self - attention module actually perform in Vision Transformer?** - The author re - examines the attention formulas in these models and finds that these models actually perform similarity - based grouping instead of the traditional attention mechanism. 3. **How does the feed - forward architecture of Vision Transformer affect its processing?** - Since Vision Transformer adopts a pure feed - forward architecture, which is different from the feed - forward and feedback mechanisms in the human visual system, these models cannot fully simulate all aspects of human visual attention. ### Main findings of the paper - **The self - attention module performs similarity - based grouping**: Through experiments on multiple Vision Transformer models, the author finds that the self - attention module mainly groups according to the visual feature similarity (such as color) between image regions, rather than selectively focusing on specific regions like human visual attention. - **The feed - forward architecture limits the performance of the attention mechanism**: The feed - forward architecture of Vision Transformer means that they can only implement bottom - up attention mechanisms, and cannot combine top - down feedback mechanisms like the human visual system. Therefore, the performance of these models in some tasks (such as single - instance detection) is not as expected. - **The experimental results support the above conclusions**: The author quantifies and evaluates the performance of Vision Transformer in these tasks by designing specific experiments (such as similarity - grouping experiments and single - instance detection experiments), further confirming that the self - attention module is more inclined to perform similarity - grouping rather than a true attention mechanism. ### Formula explanation In Vision Transformer, the formula of the self - attention mechanism is as follows: \[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \] where: - \( Q \), \( K \) and \( V \) represent the query, key and value matrices respectively; - \( d_k \) is the dimension of the key/query vector. This formula updates the representation of each token by calculating the similarity (dot product) between the query and the key and weighting and summing the value matrix according to the similarity. This mechanism leads to the similarity - based grouping effect. ### Summary Through in - depth analysis and experimental verification of the self - attention mechanism in Vision Transformer, this paper points out that the "attention mechanism" in these models is actually closer to a similarity - based grouping process rather than a true human visual attention mechanism. This finding is of great significance for understanding the working principle and limitations of Vision Transformer.

Self-attention in Vision Transformers Performs Perceptual Grouping, Not Attention

How Does Attention Work in Vision Transformers? A Visual Analytics Attempt

Emergence of Human-Like Attention in Self-Supervised Vision Transformers: an eye-tracking study

Constituent Attention for Vision Transformers

Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

Dissecting Query-Key Interaction in Vision Transformers

You Only Need Less Attention at Each Stage in Vision Transformers

Self-Attention in Transformer Networks Explains Monkeys' Gaze Pattern in Pac-Man Game

Masked Attention as a Mechanism for Improving Interpretability of Vision Transformers

Enhancing Efficiency in Vision Transformer Networks: Design Techniques and Insights

Multi-manifold Attention for Vision Transformers

Local-to-Global Self-Attention in Vision Transformers

Vision Transformers with Hierarchical Attention

FAM: Improving columnar vision transformer with feature attention mechanism

Fixating on Attention: Integrating Human Eye Tracking into Vision Transformers

Armour: Generalizable Compact Self-Attention for Vision Transformers

Rethinking Attention Mechanisms in Vision Transformers with Graph Structures

AttentionViz: A Global View of Transformer Attention

On the Surprising Effectiveness of Attention Transfer for Vision Transformers

Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers

Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers