Abstract:We propose GLL-based context-free path querying algorithm which handles queries in Extended Backus-Naur Form (EBNF) using Recursive State Machines (RSM). Utilization of EBNF allows one to combine traditional regular expressions and mutually recursive patterns in constraints natively. The proposed algorithm solves both the reachability-only and the all-paths problems for the all-pairs and the multiple sources cases. The evaluation on realworld graphs demonstrates that utilization of RSMs increases performance of query evaluation. Being implemented as a stored procedure for Neo4j, our solution demonstrates better performance than a similar solution for RedisGraph. Performance of our solution of regular path queries is comparable with performance of native Neo4j solution, and in some cases our solution requires significantly less memory.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the performance of context - free path querying (CFPQ) in actual graph databases and provide an algorithm capable of handling Extended Backus - Naur Form (EBNF) queries. Specifically, the authors propose a CFPQ method based on the Generalized LL (GLL) parsing algorithm. This method uses Recursive State Machines (RSM) to handle queries in EBNF form without having to first convert to the standard Backus - Naur Form (BNF). This method not only improves the performance of query evaluation but also can solve the reachability and all - paths problems, and is applicable to all - pairs and multi - source cases.
### Main contributions of the paper:
1. **Propose a new GLL - based CFPQ algorithm**: This algorithm can directly handle queries in EBNF form and utilizes RSM for processing. The algorithm can solve the reachability problem and all - paths problem.
2. **Implement the proposed CFPQ algorithm**: Through experimental research, it is proven that using RSM can improve the performance of query evaluation.
3. **Integrate with Neo4j graph database**: Provide a stored procedure that can be called in Cypher to execute CFPQ queries. Although Cypher has not been extended currently to express context - free path patterns, this implementation has already demonstrated good performance.
4. **Performance evaluation**: Evaluated on multiple actual graph data sets, the results show that this solution is an order of magnitude faster in performance than similar linear - algebraic methods on RedisGraph, and in some cases the required memory is significantly reduced.
### Background and motivation:
- **Context - free path querying (CFPQ)**: Allows the use of context - free grammars to specify the constraint conditions of paths in a graph. Compared with regular path querying (RPQ), CFPQ is more expressive and can be applied in fields such as bioinformatics, data provenance analysis, and static code analysis.
- **Performance issues**: The performance of existing CFPQ algorithms on actual graph databases is poor, which limits their applications. In particular, most existing algorithms can only solve the reachability problem and cannot extract all paths that meet the conditions.
- **Limitations of matrix representation**: Although matrix - based CFPQ algorithms perform well in some cases, the conversion of the matrix representation of the graph can be time - consuming and is not suitable for all scenarios.
### Key points of the solution:
- **GLL parsing algorithm**: GLL is an efficient parsing algorithm that can be naturally extended to CFPQ. It can solve not only the reachability problem but also the all - paths problem.
- **Use of RSM**: RSM is a structure similar to a finite automaton and can be used to represent context - free languages. By using RSM, complex queries can be processed more efficiently.
- **Direct handling of EBNF**: EBNF is more concise than BNF and is more suitable for expressing complex constraint conditions. Direct handling of EBNF can avoid the additional overhead introduced during the conversion process.
### Experimental results:
- **Performance improvement**: Compared with similar methods, the proposed algorithm has a significant performance improvement on actual graph data sets.
- **Memory efficiency**: In some cases, the memory required by the proposed algorithm is significantly reduced.
- **Compatibility**: Compatible with Neo4j's native RPQ solution and can be used without extending Cypher.
In conclusion, this paper solves the deficiencies of existing methods in performance and functionality by proposing a CFPQ algorithm based on GLL and RSM, providing an effective solution for complex path queries in actual graph databases.