Taxonomy of Benchmarks in Graph Representation Learning

Renming Liu,Semih Cantürk,Frederik Wenkel,Sarah McGuire,Xinyi Wang,Anna Little,Leslie O'Bray,Michael Perlmutter,Bastian Rieck,Matthew Hirn,Guy Wolf,Ladislav Rampášek
DOI: https://doi.org/10.48550/arXiv.2206.07729
2022-11-30
Abstract:Graph Neural Networks (GNNs) extend the success of neural networks to graph-structured data by accounting for their intrinsic geometry. While extensive research has been done on developing GNN models with superior performance according to a collection of graph representation learning benchmarks, it is currently not well understood what aspects of a given model are probed by them. For example, to what extent do they test the ability of a model to leverage graph structure vs. node features? Here, we develop a principled approach to taxonomize benchmarking datasets according to a $\textit{sensitivity profile}$ that is based on how much GNN performance changes due to a collection of graph perturbations. Our data-driven analysis provides a deeper understanding of which benchmarking data characteristics are leveraged by GNNs. Consequently, our taxonomy can aid in selection and development of adequate graph benchmarks, and better informed evaluation of future GNN methods. Finally, our approach and implementation in $\texttt{GTaxoGym}$ package are extendable to multiple graph prediction task types and future datasets.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that the current benchmark datasets in graph representation learning (GRL) cannot fully reflect the key aspects of the performance of graph neural networks (GNNs). Specifically, the existing benchmark datasets are insufficient in testing the ability of GNNs to utilize graph structures and node features. The authors of the paper propose a classification method based on the sensitivity of graph datasets to specific types of perturbations, aiming to gain a deeper understanding of the characteristics of different datasets and how these characteristics affect the performance of GNNs. This method not only helps in the selection and development of appropriate graph benchmark datasets but also provides more insightful information for the evaluation of future GNN methods. ### Main contributions of the paper: 1. **Propose a graph dataset classification framework**: This framework can be extended to new datasets and can be used to evaluate additional graph/task properties. 2. **Provide the first classification of GNN (and GRL) benchmark datasets**: These datasets are from TUDatasets, OGB, and other sources. 3. **Provide insights into existing datasets through classification results**: Guide the dataset selection for future GNN model benchmarking. ### Method overview: - **Node feature perturbation**: This includes setting node features to constants, one - hot encoding of node degrees, random features, etc. - **Graph structure perturbation**: This includes deleting all edges, making the graph fully connected, randomly rewiring edges, graph fragmentation, etc. - **Data - driven classification method**: Use hierarchical cluster analysis (such as the Ward method) to classify the sensitivity profiles of datasets. The sensitivity profile is established by comparing the performance changes of GNN models on perturbed datasets and original datasets. ### Results: - **Classification of inductive task datasets**: According to the sensitivity of datasets to node features and graph structure perturbations, 24 inductive task datasets are divided into three main categories. - **Classification of transductive task datasets**: A similar classification is carried out for 25 transductive task datasets, which include citation networks, social networks, etc. ### Main findings: - **Differences between datasets**: Even for datasets generated in similar fields, there are significant differences in their dependence on node features and graph structures. - **Limitations of synthetic datasets**: The existing synthetic datasets cannot fully represent the complexity of real - world data. It is recommended to use a combination of real datasets and synthetic datasets when evaluating GNN performance. - **Selection of representative datasets**: A representative subset covering dataset heterogeneity is proposed to comprehensively evaluate the performance of GNN models. ### Formula examples: - **Graph Laplacian matrix**: \[ L = D - M \] where \( D \) is the diagonal degree matrix and \( M \) is the adjacency matrix. - **Symmetrically normalized graph Laplacian matrix**: \[ N = D^{-\frac{1}{2}} L D^{-\frac{1}{2}} = I - D^{-\frac{1}{2}} M D^{-\frac{1}{2}} \] - **Node features after band - pass filtering**: \[ X_{\text{band}} = \tilde{\Phi} I_{\text{band}} \tilde{\Phi}^T X \] where \( I_{\text{band}} \) is a diagonal matrix, and the diagonal elements are 1 or 0, depending on whether the corresponding eigenvalues are within the specified frequency band. Through these methods and analyses, the paper provides researchers in the field of graph representation learning with a systematic tool for better understanding and selecting benchmark datasets.