GraphMaker: Can Diffusion Models Generate Large Attributed Graphs?

Mufei Li,Eleonora Kreačić,Vamsi K. Potluru,Pan Li
2024-10-16
Abstract:Large-scale graphs with node attributes are increasingly common in various real-world applications. Creating synthetic, attribute-rich graphs that mirror real-world examples is crucial, especially for sharing graph data for analysis and developing learning models when original data is restricted to be shared. Traditional graph generation methods are limited in their capacity to handle these complex structures. Recent advances in diffusion models have shown potential in generating graph structures without attributes and smaller molecular graphs. However, these models face challenges in generating large attributed graphs due to the complex attribute-structure correlations and the large size of these graphs. This paper introduces a novel diffusion model, GraphMaker, specifically designed for generating large attributed graphs. We explore various combinations of node attribute and graph structure generation processes, finding that an asynchronous approach more effectively captures the intricate attribute-structure correlations. We also address scalability issues through edge mini-batching generation. To demonstrate the practicality of our approach in graph data dissemination, we introduce a new evaluation pipeline. The evaluation demonstrates that synthetic graphs generated by GraphMaker can be used to develop competitive graph machine learning models for the tasks defined over the original graphs without actually accessing these graphs, while many leading graph generation methods fall short in this evaluation.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges faced in generating large - scale attributed graphs. Specifically: 1. **Complex Attribute - Structure Correlation**: In large - scale attributed graphs, there are complex associations between node attributes and graph structures. Existing generation methods perform poorly when dealing with these complex structures and cannot effectively capture the subtle relationships between attributes and structures. 2. **Scalability of Generating Large - scale Graphs**: As the scale of the graph increases, the number of edges that the generation model needs to handle grows quadratically, which poses a huge challenge to the model's scalability. 3. **Quality Evaluation of Generated Graphs**: How to evaluate the quality of generated graphs is an open problem. Traditional evaluation methods mainly rely on high - level statistical characteristics, such as node degree distribution and clustering coefficient. These characteristics are easily captured by early statistical models but are not sufficient to comprehensively evaluate the quality of generated graphs. To solve the above problems, the paper proposes a new diffusion model named GraphMaker, which is specifically designed for generating large - scale attributed graphs. The paper addresses these problems in the following ways: - **Asynchronous Generation Process**: The paper proposes an asynchronous generation process to denoise node attributes and graph structures separately, thereby more effectively capturing the complex correlations between them. - **Scalability Strategy**: To address the scalability challenges of large - scale graph generation, the paper adopts an edge mini - batch generation strategy and designs a new Message Passing Neural Network (MPNN) to efficiently encode data. - **Fine - grained Evaluation Method**: The paper proposes an evaluation protocol based on machine - learning models. By training models on the generated graphs and testing their performance on the original graphs, the quality of the generated graphs is evaluated. This method can more meticulously examine the performance of generated graphs in practical applications. Through these innovations, GraphMaker has been experimented on multiple real - world large - scale graph datasets, and the results show that it is significantly superior to existing methods in generating large - scale graphs with realistic attributes.