Abstract:Transformers, renowned for their self-attention mechanism, have achieved state-of-the-art performance across various tasks in natural language processing, computer vision, time-series modeling, etc. However, one of the challenges with deep Transformer models is the oversmoothing problem, where representations across layers converge to indistinguishable values, leading to significant performance degradation. We interpret the original self-attention as a simple graph filter and redesign it from a graph signal processing (GSP) perspective. We propose a graph-filter-based self-attention (GFSA) to learn a general yet effective one, whose complexity, however, is slightly larger than that of the original self-attention mechanism. We demonstrate that GFSA improves the performance of Transformers in various fields, including computer vision, natural language processing, graph-level tasks, speech recognition, and code classification.

What problem does this paper attempt to address?

### The problems the paper attempts to solve This paper attempts to solve the **oversmoothing problem** in the Transformer model. Specifically, the oversmoothing problem means that in deep - layer Transformer models, the features between representation layers gradually become similar, leading to a significant performance degradation. By redesigning the self - attention mechanism from the perspective of Graph Signal Processing (GSP), the paper proposes a Graph Filter - based Self - Attention (GFSA) to improve this problem. ### Detailed Explanation 1. **Background and Motivation**: - **The Success and Limitations of Transformer**: The Transformer model has achieved state - of - the - art performance in natural language processing, computer vision, time - series modeling, etc. due to its self - attention mechanism. However, deep - layer Transformer models have the oversmoothing problem, that is, the features between representation layers gradually become similar, resulting in performance degradation. - **The Graph Filter Perspective of the Self - Attention Mechanism**: The paper interprets the self - attention mechanism as a simple graph filter and redesigns the self - attention mechanism from the perspective of graph signal processing. 2. **Proposed Method**: - **Graph Filter - based Self - Attention Mechanism (GFSA)**: GFSA consists of an identity term and two matrix polynomial terms, namely \(\bar{A}\) and \(\bar{A}^K\). Here, \(\bar{A}\) is the learned attention matrix, and \(K\) is a hyperparameter. - **Approximation of High - Order Terms**: To reduce the computational cost, the paper uses the first - order Taylor expansion to approximate \(\bar{A}^K\), that is: \[ \bar{A}^K \approx \bar{A}+(K - 1)(\bar{A}^2-\bar{A}) \] 3. **Experimental Results**: - **Multi - task Verification**: The paper verifies the effectiveness of GFSA in multiple fields, including natural language understanding, image classification, graph - level tasks, speech recognition, and code classification. - **Performance Improvement**: GFSA shows performance improvement on different Transformer backbone models. For example, in the image classification task, GFSA improves the Top - 1 accuracy of DeiT - S by 1.63%; in the natural language understanding task, GFSA improves the performance of RoBERTa on the CoLA dataset by 3.77%. ### Conclusion By redesigning the self - attention mechanism from the perspective of graph signal processing, the paper proposes the Graph Filter - based Self - Attention (GFSA), which effectively solves the oversmoothing problem in the Transformer model and achieves significant performance improvements on multiple tasks.

Graph Convolutions Enrich the Self-Attention in Transformers!

SignGT: Signed Attention-based Graph Transformer for Graph Representation Learning

Hybrid Focal and Full-Range Attention Based Graph Transformers

Full-Attention Driven Graph Contrastive Learning: with Effective Mutual Information Insight

Self-Attention in Colors: Another Take on Encoding Graph Structure in Transformers

SGFormer: Simplifying and Empowering Transformers for Large-Graph Representations

Graph-Aware Transformer: Is Attention All Graphs Need?

What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding

Demystify Transformers & Convolutions in Modern Image Deep Networks

Adaptive Multi-Neighborhood Attention based Transformer for Graph Representation Learning

Sparse Graph Transformer with Contrastive Learning

Graph Transformers for Large Graphs

GTC: GNN-Transformer Co-contrastive Learning for Self-supervised Heterogeneous Graph Representation

Graph Transformer: Learning Better Representations for Graph Neural Networks.

Graph Transformers: A Survey

Transformers as Graph-to-Graph Models

Hierarchical Graph Transformer with Adaptive Node Sampling

GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets

Centered Self-Attention Layers

Rethinking Attention Mechanisms in Vision Transformers with Graph Structures