GraphAr: An Efficient Storage Scheme for Graph Data in Data Lakes

Xue Li,Weibin Zeng,Zhibin Wang,Diwen Zhu,Jingbo Xu,Wenyuan Yu,Jingren Zhou
2024-09-25
Abstract:Data lakes, increasingly adopted for their ability to store and analyze diverse types of data, commonly use columnar storage formats like Parquet and ORC for handling relational tables. However, these traditional setups fall short when it comes to efficiently managing graph data, particularly those conforming to the Labeled Property Graph (LPG) model. To address this gap, this paper introduces GraphAr, a specialized storage scheme designed to enhance existing data lakes for efficient graph data management. Leveraging the strengths of Parquet, GraphAr captures LPG semantics precisely and facilitates graph-specific operations such as neighbor retrieval and label filtering. Through innovative data organization, encoding, and decoding techniques, GraphAr dramatically improves performance. Our evaluations reveal that GraphAr outperforms conventional Parquet and Acero-based methods, achieving an average speedup of 4452x for neighbor retrieval, 14.8x for label filtering, and 29.5x for end-to-end workloads. These findings highlight GraphAr's potential to extend the utility of data lakes by enabling efficient graph data management.
Databases
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the inefficiency encountered in managing graph data (especially data conforming to the Labeled Property Graph (LPG) model) in existing data lake architectures. Specifically, traditional column - store formats such as Parquet and ORC perform well in handling relational table data, but are deficient in efficiently managing and querying graph data. These deficiencies are mainly reflected in the following aspects: 1. **Effective Representation of LPG**: Existing column - store formats cannot effectively capture the complex semantics and relationships in LPG, resulting in difficulty in accurately representing and querying graph data. 2. **Efficient Neighbor Retrieval**: A fundamental operation in graph queries is neighbor retrieval, that is, quickly accessing adjacent vertices and edges. However, existing column - store formats do not support this crucial operation, leading to poor performance. 3. **Optimized Label Filtering**: Label filtering is an important mechanism in graph queries, used for early exclusion of irrelevant data. Existing column - store formats lack support for this operation, making label filtering inefficient. To address these challenges, the paper introduces GraphAr, a specially - designed storage scheme aimed at enhancing existing data lakes to achieve efficient graph data management. GraphAr solves these problems in the following ways: - **Effective Representation of LPG**: GraphAr utilizes Parquet as the underlying storage format and introduces standardized YAML files to represent the schema metadata of LPG. This combination can not only fully express the semantics of LPG, but also ensure compatibility with the data lake ecosystem. - **Efficient Neighbor Retrieval**: GraphAr achieves an efficient representation similar to CSR or CSC by organizing edges as sorted tables and using Parquet's delta encoding technique. In addition, GraphAr introduces an innovative decoding algorithm that further accelerates the neighbor retrieval process by leveraging BMI and SIMD instruction sets. - **Optimized Label Filtering**: GraphAr adopts the RLE (Run - Length Encoding) technique and introduces an interval - based decoding algorithm, thereby improving the efficiency of label filtering. Through these techniques and strategies, GraphAr significantly improves the performance of graph data management. Experimental results show that GraphAr achieves an average speed - up of 4,452 times, 14.8 times, and 29.5 times in neighbor retrieval, label filtering, and end - to - end workloads respectively. These results indicate that GraphAr has the potential to expand the functionality of data lakes, enabling them to manage graph data more efficiently.