Abstract:Despite that going deep has proven successful in many neural architectures, the existing graph transformers are relatively shallow. In this work, we explore whether more layers are beneficial to graph transformers, and find that current graph transformers suffer from the bottleneck of improving performance by increasing depth. Our further analysis reveals the reason is that deep graph transformers are limited by the vanishing capacity of global attention, restricting the graph transformer from focusing on the critical substructure and obtaining expressive features. To this end, we propose a novel graph transformer model named DeepGraph that explicitly employs substructure tokens in the encoded representation, and applies local attention on related nodes to obtain substructure based attention encoding. Our model enhances the ability of the global attention to focus on substructures and promotes the expressiveness of the representations, addressing the limitation of self-attention as the graph transformer deepens. Experiments show that our method unblocks the depth limitation of graph transformers and results in state-of-the-art performance across various graph benchmarks with deeper models.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper primarily explores the performance bottleneck issue of Graph Transformers when increasing the number of layers and proposes a new solution. #### Main Research Content: 1. **Performance Bottleneck of Graph Transformers**: Existing Graph Transformers typically have fewer layers (less than 12 layers). When the number of layers is increased, their performance improvement is limited. Specifically, performance decreases rather than increases beyond 12 layers. 2. **Theoretical Analysis**: The authors further analyze and find that deep Graph Transformers are constrained by the weakened capability of the global attention mechanism, leading to an inability to effectively focus on key substructures and obtain expressive features. 3. **Solution**: A new model named DeepGraph is proposed, which enhances the focus on substructures by introducing a local attention mechanism. This model explicitly embeds substructures into the encoded representation and applies local attention on relevant nodes, thereby enhancing the global attention's focus on substructures and improving the expressiveness of the representation. #### Contributions: 1. Revealed the bottleneck issue encountered by Graph Transformers when increasing the number of layers and conducted theoretical and empirical analysis from the perspective of attention capacity decay with the number of layers. 2. Proposed a substructure-based local attention mechanism that significantly enhances the focus on substructure features and the expressiveness of deep Graph Transformers. 3. Experiments show that this method breaks the layer number limitation of Graph Transformers and achieves the best performance in various graph benchmarks. Through this research, the paper aims to address the performance improvement limitation of Graph Transformers when increasing the number of layers and provides an effective solution.

Are More Layers Beneficial to Graph Transformers?

NGAT: Attention in Breadth and Depth Exploration for Semi-Supervised Graph Representation Learning

Unleashing the Power of Transformer for Graphs

Less is More: on the Over-Globalizing Problem in Graph Transformers

On the Theoretical Expressive Power and the Design Space of Higher-Order Graph Transformers

Reach the Remote Neighbors: Dual-Encoding Transformer for Graphs

Representational Strengths and Limitations of Transformers

SGFormer: Simplifying and Empowering Transformers for Large-Graph Representations

Do Transformers Really Perform Bad for Graph Representation?

What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding

SGFormer: Single-Layer Graph Transformers with Approximation-Free Linear Complexity

Rewiring the Transformer with Depth-Wise LSTMs

The Impact of Depth on Compositional Generalization in Transformer Language Models

Enhancing Graph Transformers with Hierarchical Distance Structural Encoding

Deep Transformers with Latent Depth

Adaptive Multi-Neighborhood Attention based Transformer for Graph Representation Learning

Towards Principled Graph Transformers

Do Transformers Really Perform Badly for Graph Representation?

Transformers are efficient hierarchical chemical graph learners

An Intrinsic Dimension Perspective of Transformers for Sequential Modeling

Transformers as Graph-to-Graph Models