Abstract:How can we mine frequent path regularities from a graph with edge labels and vertex attributes? The task of association rule mining successfully discovers regular patterns in item sets and substructures. Still, to our best knowledge, this concept has not yet been extended to path patterns in large property graphs. In this paper, we introduce the problem of path association rule mining (PARM). Applied to any \emph{reachability path} between two vertices within a large graph, PARM discovers regular ways in which path patterns, identified by vertex attributes and edge labels, co-occur with each other. We develop an efficient and scalable algorithm PIONEER that exploits an anti-monotonicity property to effectively prune the search space. Further, we devise approximation techniques and employ parallelization to achieve scalable path association rule mining. Our experimental study using real-world graph data verifies the significance of path association rules and the efficiency of our solutions.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to mine frequent path patterns and their association rules from unlabeled large property graphs. Specifically, the paper introduces the Path Association Rule Mining (PARM) problem, aiming to discover the co - occurrence rules of path patterns identified by vertex attributes and edge labels in the reachable paths between two vertices in large graphs.
### Problem Background
Traditional association rule mining methods have successfully discovered regular patterns in item sets and sub - structures, but have not been extended to path patterns in large graphs. The existing graph association rule mining methods have the following deficiencies:
1. **Vertex Restriction**: It requires that the set of vertices in the consequent is a subset of the set of vertices in the antecedent, and cannot handle association rules containing new vertices.
2. **Pattern Restriction**: It only considers specific restricted graph patterns, such as a single edge or a sub - graph without attributes, and cannot handle edge labels and vertex attributes simultaneously.
3. **Lack of Reachability Patterns**: It cannot capture the reachability patterns between vertices, that is, one vertex is reachable from another vertex through any number of label - constrained directed edges.
### Solution
To solve the above problems, the authors propose Path Association Rule Mining (PARM), and its main contributions include:
- **Concept**: A new, concise and elegant concept - path association rule is proposed to express the co - occurrence rules of vertex attribute sets and edge label sequences, and allows the calculation of metrics such as absolute support, relative support, confidence and lift.
- **Algorithm**: An efficient and scalable algorithm, Pioneer, is developed. It effectively prunes candidate frequent path patterns by using the anti - monotonicity property and supports parallelization to improve scalability.
- **Application**: The effectiveness of path association rule mining in bias checking and knowledge extraction is demonstrated.
### Key Technologies
- **Anti - Monotonicity Property**: The anti - monotonicity of path patterns is used to reduce the number of candidate path patterns.
- **Boundary Pruning**: Path patterns that do not meet the conditions are pruned based on upper - bound estimation.
- **Enhanced Candidate Generation**: More complex path patterns are generated by combining vertical expansion and horizontal combination.
### Experimental Results
Experiments are carried out on four real - world large - graph data sets to verify the effectiveness and efficiency of the proposed algorithm. Compared with the baseline methods, the Pioneer algorithm can be accelerated by up to 151 times, and the approximate scheme can be up to 485 times faster, although there will be a small loss of accuracy.
In summary, this paper aims to solve the limitations of existing graph association rule mining methods in handling path patterns, proposes a new path association rule mining framework, and verifies its effectiveness and superiority through efficient algorithm implementation and practical applications.