Jeongwhan Choi,Hyowon Wi,Jayoung Kim,Yehjin Shin,Kookjin Lee,Nathaniel Trask,Noseong Park
Abstract:Transformers, renowned for their self-attention mechanism, have achieved state-of-the-art performance across various tasks in natural language processing, computer vision, time-series modeling, etc. However, one of the challenges with deep Transformer models is the oversmoothing problem, where representations across layers converge to indistinguishable values, leading to significant performance degradation. We interpret the original self-attention as a simple graph filter and redesign it from a graph signal processing (GSP) perspective. We propose a graph-filter-based self-attention (GFSA) to learn a general yet effective one, whose complexity, however, is slightly larger than that of the original self-attention mechanism. We demonstrate that GFSA improves the performance of Transformers in various fields, including computer vision, natural language processing, graph-level tasks, speech recognition, and code classification.
What problem does this paper attempt to address?
### The problems the paper attempts to solve
This paper attempts to solve the **oversmoothing problem** in the Transformer model. Specifically, the oversmoothing problem means that in deep - layer Transformer models, the features between representation layers gradually become similar, leading to a significant performance degradation. By redesigning the self - attention mechanism from the perspective of Graph Signal Processing (GSP), the paper proposes a Graph Filter - based Self - Attention (GFSA) to improve this problem.
### Detailed Explanation
1. **Background and Motivation**:
- **The Success and Limitations of Transformer**: The Transformer model has achieved state - of - the - art performance in natural language processing, computer vision, time - series modeling, etc. due to its self - attention mechanism. However, deep - layer Transformer models have the oversmoothing problem, that is, the features between representation layers gradually become similar, resulting in performance degradation.
- **The Graph Filter Perspective of the Self - Attention Mechanism**: The paper interprets the self - attention mechanism as a simple graph filter and redesigns the self - attention mechanism from the perspective of graph signal processing.
2. **Proposed Method**:
- **Graph Filter - based Self - Attention Mechanism (GFSA)**: GFSA consists of an identity term and two matrix polynomial terms, namely \(\bar{A}\) and \(\bar{A}^K\). Here, \(\bar{A}\) is the learned attention matrix, and \(K\) is a hyperparameter.
- **Approximation of High - Order Terms**: To reduce the computational cost, the paper uses the first - order Taylor expansion to approximate \(\bar{A}^K\), that is:
\[
\bar{A}^K \approx \bar{A}+(K - 1)(\bar{A}^2-\bar{A})
\]
3. **Experimental Results**:
- **Multi - task Verification**: The paper verifies the effectiveness of GFSA in multiple fields, including natural language understanding, image classification, graph - level tasks, speech recognition, and code classification.
- **Performance Improvement**: GFSA shows performance improvement on different Transformer backbone models. For example, in the image classification task, GFSA improves the Top - 1 accuracy of DeiT - S by 1.63%; in the natural language understanding task, GFSA improves the performance of RoBERTa on the CoLA dataset by 3.77%.
### Conclusion
By redesigning the self - attention mechanism from the perspective of graph signal processing, the paper proposes the Graph Filter - based Self - Attention (GFSA), which effectively solves the oversmoothing problem in the Transformer model and achieves significant performance improvements on multiple tasks.