Efficient Graph Encoder Embedding for Large Sparse Graphs in Python

Xihan Qin,Cencheng Shen
2024-06-06
Abstract:Graph is a ubiquitous representation of data in various research fields, and graph embedding is a prevalent machine learning technique for capturing key features and generating fixed-sized attributes. However, most state-of-the-art graph embedding methods are computationally and spatially expensive. Recently, the Graph Encoder Embedding (GEE) has been shown as the fastest graph embedding technique and is suitable for a variety of network data applications. As real-world data often involves large and sparse graphs, the huge sparsity usually results in redundant computations and storage. To address this issue, we propose an improved version of GEE, sparse GEE, which optimizes the calculation and storage of zero entries in sparse matrices to enhance the running time further. Our experiments demonstrate that the sparse version achieves significant speedup compared to the original GEE with Python implementation for large sparse graphs, and sparse GEE is capable of processing millions of edges within minutes on a standard laptop.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: when dealing with large - scale sparse graphs, the existing graph embedding methods (such as Graph Encoder Embedding, GEE) are less efficient in terms of computation and storage. Specifically: 1. **Computation efficiency problem**: When dealing with large - scale sparse graphs, most state - of - the - art graph embedding methods have redundant computations during the calculation process due to the large number of zero elements in the sparse matrix, which affects the running speed of the algorithm. 2. **Storage efficiency problem**: Traditional graph embedding methods do not fully utilize the characteristics of sparse matrices when storing them, resulting in a waste of storage space. Especially when dealing with large - scale graphs, the storage cost is very large. To solve these problems, the author proposes an improved version of GEE - **sparse GEE (sparse GEE)**, which further improves the running time and storage efficiency of the algorithm by optimizing the calculation and storage methods of sparse matrices. Specific improvements include: - Using the **Compressed Sparse Row (CSR)** data structure to represent and calculate the embedding matrix, reducing the storage and calculation of zero elements. - Using the **Dictionary of Keys (DOK)** data structure in the intermediate result construction stage and converting it to CSR format for subsequent calculations. Through these improvements, sparse GEE can significantly improve performance when dealing with large - scale sparse graphs, especially when additional options such as Laplacian normalization are enabled. Experimental results show that when dealing with large - scale graphs containing millions of edges, sparse GEE can complete the embedding task within a few minutes on an ordinary laptop, and compared with the original GEE, sparse GEE achieves an 86 - fold speed improvement on the largest simulated data set.