Abstract:Graph generative models are a highly active branch of machine learning. Given the steady development of new models of ever-increasing complexity, it is necessary to provide a principled way to evaluate and compare them. In this paper, we enumerate the desirable criteria for such a comparison metric and provide an overview of the status quo of graph generative model comparison in use today, which predominantly relies on the maximum mean discrepancy (MMD). We perform a systematic evaluation of MMD in the context of graph generative model comparison, highlighting some of the challenges and pitfalls researchers inadvertently may encounter. After conducting a thorough analysis of the behaviour of MMD on synthetically-generated perturbed graphs as well as on recently-proposed graph generative models, we are able to provide a suitable procedure to mitigate these challenges and pitfalls. We aggregate our findings into a list of practical recommendations for researchers to use when evaluating graph generative models.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that the current evaluation metrics for graph generation models are defective and insufficient. Specifically: 1. **Selection of evaluation methods**: Most current research relies on Maximum Mean Discrepancy (MMD) to compare graph generation models, but the performance of MMD in graph generation model evaluation has not been systematically studied. 2. **Limitations of MMD**: The behavior of MMD highly depends on the choice of kernel functions and parameters. Different choices may lead to completely different model rankings, which makes the evaluation results unstable and difficult to interpret. 3. **Lack of standard procedures**: Researchers do not have a unified standard when choosing kernel functions and parameters, resulting in difficulty in directly comparing the results between different studies. To solve these problems, the author has carried out the following work: - **Systematically evaluating MMD**: By using synthetic data and existing graph generation models, the author has analyzed in detail the behavior of MMD under different conditions, revealing its potential problems and pitfalls. - **Proposing improvement suggestions**: Based on the experimental results, the author has provided a series of practical suggestions to help researchers use MMD more reasonably for graph generation model evaluation. ### Specific problems and solutions 1. **High sensitivity of MMD to the choice of kernel functions and parameters**: - The author has shown the influence of different kernel functions (such as EMD, TV, RBF) and parameters (such as σ values) on MMD results, proving that improper selection will lead to unreasonable model rankings. - **Solution**: It is recommended to use more robust kernel functions (such as RBF or Laplacian kernel), and emphasize the need to determine the appropriate parameter range through experiments. 2. **Lack of an inherent scale in MMD**: - The current practice is to directly use the original MMD distance value, which makes it difficult to intuitively understand the performance differences between different models. - **Solution**: It is recommended to calculate the MMD distance between the test set and the training set as a reference benchmark to provide a meaningful scale. 3. **Arbitrariness in the choice of kernel functions**: - Different studies use different kernel functions and lack clear selection criteria. - **Solution**: It is recommended to use kernel functions with high computational efficiency and effectiveness (such as RBF, Laplacian or linear kernel), and avoid using the EMD kernel with high computational complexity. 4. **Choice of descriptor functions**: - The choice of descriptor functions affects the performance of MMD. Common descriptors include degree distribution, clustering coefficient and Laplacian spectral histogram. - **Solution**: It is recommended to choose appropriate descriptor functions according to the specific application field and make necessary adjustments. Through these efforts, the author aims to provide a more scientific and reliable method for the evaluation of graph generation models, thereby promoting the further development of this field.

Evaluation Metrics for Graph Generative Models: Problems, Pitfalls, and Practical Solutions

An empirical study on evaluation metrics of generative adversarial networks

A Study on the Evaluation of Generative Models

Towards Robust Evaluation of Protein Generative Models: A Systematic Analysis of Metrics

On the Role of Edge Dependency in Graph Generative Models

Pros and cons of GAN evaluation measures

Analyzing Generative Models by Manifold Entropic Metrics

Towards quantitative methods to assess network generative models

Beyond Statistical Similarity: Rethinking Metrics for Deep Generative Models in Engineering Design

A Review and Efficient Implementation of Scene Graph Generation Metrics

A Fair Comparison of Graph Neural Networks for Graph Classification

Rethinking the Evaluation of Unbiased Scene Graph Generation

Evaluating Generative Models for Graph-to-Text Generation

Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks

Domain-agnostic and Multi-level Evaluation of Generative Models

Graph Similarity Description: How Are These Graphs Similar?

Evaluation Metrics for Conditional Image Generation

Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis

How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models

Ranking evaluation metrics from a group-theoretic perspective

Attribute Based Interpretable Evaluation Metrics for Generative Models