Self-attention in Vision Transformers Performs Perceptual Grouping, Not Attention

Paria Mehrani,John K. Tsotsos
2023-03-03
Abstract:Recently, a considerable number of studies in computer vision involves deep neural architectures called vision transformers. Visual processing in these models incorporates computational models that are claimed to implement attention mechanisms. Despite an increasing body of work that attempts to understand the role of attention mechanisms in vision transformers, their effect is largely unknown. Here, we asked if the attention mechanisms in vision transformers exhibit similar effects as those known in human visual attention. To answer this question, we revisited the attention formulation in these models and found that despite the name, computationally, these models perform a special class of relaxation labeling with similarity grouping effects. Additionally, whereas modern experimental findings reveal that human visual attention involves both feed-forward and feedback mechanisms, the purely feed-forward architecture of vision transformers suggests that attention in these models will not have the same effects as those known in humans. To quantify these observations, we evaluated grouping performance in a family of vision transformers. Our results suggest that self-attention modules group figures in the stimuli based on similarity in visual features such as color. Also, in a singleton detection experiment as an instance of saliency detection, we studied if these models exhibit similar effects as those of feed-forward visual salience mechanisms utilized in human visual attention. We found that generally, the transformer-based attention modules assign more salience either to distractors or the ground. Together, our study suggests that the attention mechanisms in vision transformers perform similarity grouping and not attention.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to explore whether the self - attention mechanism in Vision Transformer truly realizes the function similar to human visual attention. Specifically, the author raises the following questions: 1. **Does the self - attention mechanism in Vision Transformer work like human visual attention?** - Although many studies claim that Vision Transformer realizes the attention function through the self - attention mechanism, it is still unclear whether the effects of these models are consistent with the mechanism of human visual attention. 2. **What operations does the self - attention module actually perform in Vision Transformer?** - The author re - examines the attention formulas in these models and finds that these models actually perform similarity - based grouping instead of the traditional attention mechanism. 3. **How does the feed - forward architecture of Vision Transformer affect its processing?** - Since Vision Transformer adopts a pure feed - forward architecture, which is different from the feed - forward and feedback mechanisms in the human visual system, these models cannot fully simulate all aspects of human visual attention. ### Main findings of the paper - **The self - attention module performs similarity - based grouping**: Through experiments on multiple Vision Transformer models, the author finds that the self - attention module mainly groups according to the visual feature similarity (such as color) between image regions, rather than selectively focusing on specific regions like human visual attention. - **The feed - forward architecture limits the performance of the attention mechanism**: The feed - forward architecture of Vision Transformer means that they can only implement bottom - up attention mechanisms, and cannot combine top - down feedback mechanisms like the human visual system. Therefore, the performance of these models in some tasks (such as single - instance detection) is not as expected. - **The experimental results support the above conclusions**: The author quantifies and evaluates the performance of Vision Transformer in these tasks by designing specific experiments (such as similarity - grouping experiments and single - instance detection experiments), further confirming that the self - attention module is more inclined to perform similarity - grouping rather than a true attention mechanism. ### Formula explanation In Vision Transformer, the formula of the self - attention mechanism is as follows: \[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \] where: - \( Q \), \( K \) and \( V \) represent the query, key and value matrices respectively; - \( d_k \) is the dimension of the key/query vector. This formula updates the representation of each token by calculating the similarity (dot product) between the query and the key and weighting and summing the value matrix according to the similarity. This mechanism leads to the similarity - based grouping effect. ### Summary Through in - depth analysis and experimental verification of the self - attention mechanism in Vision Transformer, this paper points out that the "attention mechanism" in these models is actually closer to a similarity - based grouping process rather than a true human visual attention mechanism. This finding is of great significance for understanding the working principle and limitations of Vision Transformer.