Evaluating Graph Neural Networks for Link Prediction: Current Pitfalls and New Benchmarking

Juanhui Li,Harry Shomer,Haitao Mao,Shenglai Zeng,Yao Ma,Neil Shah,Jiliang Tang,Dawei Yin
2023-11-19
Abstract:Link prediction attempts to predict whether an unseen edge exists based on only a portion of edges of a graph. A flurry of methods have been introduced in recent years that attempt to make use of graph neural networks (GNNs) for this task. Furthermore, new and diverse datasets have also been created to better evaluate the effectiveness of these new models. However, multiple pitfalls currently exist that hinder our ability to properly evaluate these new methods. These pitfalls mainly include: (1) Lower than actual performance on multiple baselines, (2) A lack of a unified data split and evaluation metric on some datasets, and (3) An unrealistic evaluation setting that uses easy negative samples. To overcome these challenges, we first conduct a fair comparison across prominent methods and datasets, utilizing the same dataset and hyperparameter search settings. We then create a more practical evaluation setting based on a Heuristic Related Sampling Technique (HeaRT), which samples hard negative samples via multiple heuristics. The new evaluation setting helps promote new challenges and opportunities in link prediction by aligning the evaluation with real-world situations. Our implementation and data are available at <a class="link-external link-https" href="https://github.com/Juanhui28/HeaRT" rel="external noopener nofollow">this https URL</a>
Machine Learning,Social and Information Networks
What problem does this paper attempt to address?
This paper attempts to solve the evaluation problem of graph neural networks (GNNs) in link prediction tasks. Specifically, the paper points out several major problems in current evaluation methods and proposes a new evaluation setting to improve these problems. The following are the main problems that the paper attempts to solve: 1. **Underestimation of performance**: - The paper points out that the actual performance of some models is underestimated. For example, the standard GNN has poor performance due to improper hyper - parameter tuning. Through appropriate tuning, these models can significantly improve their performance. For some methods (such as Neo - GNN), the performance improvement can even reach 8.5 percentage points. 2. **Lack of a unified evaluation setting**: - Different studies use different data set splits and evaluation metrics, making it difficult to make a fair comparison. For example, data sets such as Cora, Citeseer, and Pubmed use different training/validation/test split ratios and evaluation metrics (such as AUC and MRR) in different studies. In addition, some methods will include validation edges during testing, while others will not, which further increases the complexity of comparison. 3. **Unrealistic evaluation setting**: - The current evaluation setting uses randomly selected negative samples for evaluation, which makes the task too simple and not in line with the actual situation. For example, when recommending friends in a social network, we are more concerned about recommending friends for a specific user u, rather than pairing u with other unrelated nodes. In addition, randomly selected negative samples usually have no common neighbors, so they are easy to classify and cannot reflect the performance of the model in practical applications. To overcome these problems, the paper proposes the following solutions: - **Reproducible and fair comparison**: - Under the existing evaluation setting, a fair comparison of different models on multiple common data sets is made. All models are tuned within the same hyper - parameter range and evaluated using multiple evaluation metrics. - **New evaluation setting (HeaRT)**: - A new evaluation setting based on the Heuristic Related Sampling Technique (HeaRT) is proposed. HeaRT creates a more challenging evaluation task by personalizing negative samples and selecting more difficult negative samples, thereby better simulating the real - world situation. Through these improvements, the paper aims to provide a more accurate and reliable link prediction evaluation method to promote the further development of this field.