Stronger Graph Transformer with Regularized Attention Scores

Eugene Ku
2024-03-22
Abstract:Graph Neural Networks are notorious for its memory consumption. A recent Transformer-based GNN called Graph Transformer is shown to obtain superior performances when long range dependencies exist. However, combining graph data and Transformer architecture led to a combinationally worse memory issue. We propose a novel version of "edge regularization technique" that alleviates the need for Positional Encoding and ultimately alleviate GT's out of memory issue. We observe that it is not clear whether having an edge regularization on top of positional encoding is helpful. However, it seems evident that applying our edge regularization technique indeed stably improves GT's performance compared to GT without Positional Encoding.
Machine Learning
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper attempts to address two major issues in Graph Neural Networks (GNNs): 1. **Memory Consumption Issue**: - Graph Neural Networks (especially Graph Transformers based on Transformer architecture) face severe memory consumption issues when handling large-scale graph data. This is because the Transformer architecture itself has high memory requirements, and this problem becomes more severe when combined with graph data. 2. **Long-Distance Dependency Issue**: - Traditional Message Passing Neural Networks (MPNNs) struggle to capture long-distance dependencies in graph data due to oversquashing and oversmoothing issues. These problems limit the ability of MPNNs to build deep GNNs. ### Solutions To address the above issues, the authors propose the following methods: 1. **Edge Regularization Technique**: - By introducing a new "edge regularization" technique, the computation process of Graph Transformers can be optimized without adding extra features or positional encodings, thereby reducing memory consumption. - Specifically, this technique caches the attention score matrix in each Graph Transformer layer and introduces an additional loss function during the gradient computation step, which takes the cached attention score matrix and the true adjacency matrix as inputs. - To prevent interference with the optimization of the main loss function, the authors truncate the gradient of the new loss function during backpropagation. 2. **Avoiding the Necessity of Positional Encoding**: - Traditional Graph Transformers usually require positional encoding to maintain graph structure information, but positional encoding significantly increases memory consumption. The authors' method aims to reduce or even eliminate the need for positional encoding through the edge regularization technique, further reducing memory consumption. ### Experimental Validation The authors conducted experiments on multiple datasets, including Peptides-func, Peptides-Struct, and PascalVOC-SP from the Long Range Graph Benchmark. The experimental results show that: - The edge regularization technique can steadily improve the performance of Graph Transformers without positional encoding. - Compared to traditional Graph Transformers, this method shows significant performance improvement in certain tasks. - However, in some cases, positional encoding and edge regularization techniques may interfere with each other, leading to performance degradation. ### Application Research As a supplementary study, the authors also applied GraphGPS to an event reconstruction task on a dataset containing long-distance dependencies (neutrino detection data from photomultiplier tubes). The experimental results show that GraphGPS performs excellently in this task, further validating the importance of long-distance dependencies. ### Conclusion Although it is unclear whether the edge regularization technique can completely replace positional encoding, experimental results indicate that this technique can steadily improve the performance of Graph Transformers without positional encoding. Future research directions may include exploring more variants of regularization techniques and finding more effective solutions to alleviate the memory consumption issue of Graph Transformers.