Are More Layers Beneficial to Graph Transformers?

Haiteng Zhao,Shuming Ma,Dongdong Zhang,Zhi-Hong Deng,Furu Wei
2023-03-01
Abstract:Despite that going deep has proven successful in many neural architectures, the existing graph transformers are relatively shallow. In this work, we explore whether more layers are beneficial to graph transformers, and find that current graph transformers suffer from the bottleneck of improving performance by increasing depth. Our further analysis reveals the reason is that deep graph transformers are limited by the vanishing capacity of global attention, restricting the graph transformer from focusing on the critical substructure and obtaining expressive features. To this end, we propose a novel graph transformer model named DeepGraph that explicitly employs substructure tokens in the encoded representation, and applies local attention on related nodes to obtain substructure based attention encoding. Our model enhances the ability of the global attention to focus on substructures and promotes the expressiveness of the representations, addressing the limitation of self-attention as the graph transformer deepens. Experiments show that our method unblocks the depth limitation of graph transformers and results in state-of-the-art performance across various graph benchmarks with deeper models.
Machine Learning
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper primarily explores the performance bottleneck issue of Graph Transformers when increasing the number of layers and proposes a new solution. #### Main Research Content: 1. **Performance Bottleneck of Graph Transformers**: Existing Graph Transformers typically have fewer layers (less than 12 layers). When the number of layers is increased, their performance improvement is limited. Specifically, performance decreases rather than increases beyond 12 layers. 2. **Theoretical Analysis**: The authors further analyze and find that deep Graph Transformers are constrained by the weakened capability of the global attention mechanism, leading to an inability to effectively focus on key substructures and obtain expressive features. 3. **Solution**: A new model named DeepGraph is proposed, which enhances the focus on substructures by introducing a local attention mechanism. This model explicitly embeds substructures into the encoded representation and applies local attention on relevant nodes, thereby enhancing the global attention's focus on substructures and improving the expressiveness of the representation. #### Contributions: 1. Revealed the bottleneck issue encountered by Graph Transformers when increasing the number of layers and conducted theoretical and empirical analysis from the perspective of attention capacity decay with the number of layers. 2. Proposed a substructure-based local attention mechanism that significantly enhances the focus on substructure features and the expressiveness of deep Graph Transformers. 3. Experiments show that this method breaks the layer number limitation of Graph Transformers and achieves the best performance in various graph benchmarks. Through this research, the paper aims to address the performance improvement limitation of Graph Transformers when increasing the number of layers and provides an effective solution.