Graph Pattern Detection and Structural Redundancy Reduction to Compress Named Graphs

Tangina Sultana,Md. Delowar Hossain,Muhammad Umair,Muhammad Numan Khan,Aftab Alam,Young-Koo Lee
DOI: https://doi.org/10.1016/j.ins.2023.119428
IF: 8.1
2023-01-01
Information Sciences
Abstract:The flexible paradigm of Resource Description Framework (RDF) has accelerated the raw data published on the web. Therefore, the volume of generated RDF data has increased impressively in the last decade promoting compression to manage and reduce the size of RDF datasets. Universal RDF compressors can be able to detect and remove redundancy at symbolic, syntactic, or semantic levels. However, these compressors rarely exploit the graph patterns as well as structural regularities in real-world datasets. An efficient approach for compressing the RDF datasets based on the structural properties is HDT (Header-Dictionary-Triple). However, it cannot manage the RDF datasets with named graphs, the regularities of the graph structure, and structural redundancies. Because HDT considers all the triples to reside in the same default graph. Though, the triples of an RDF dataset belong to various named graphs. In this study, we have proposed a novel approach to deal with the above-mentioned challenges. We introduce hybrid TI-GI (Triple Interpreter-Graph Interpreter) to manage the RDF datasets with named graphs and use compact RDF serialization. We also propose RDF-RR (RDF-Redundancy Reducer) and object mapping that detects and removes structural redundancies by identifying the common patterns related to the predicates and objects in the RDF datasets. We employ a differential compressor to discover the frequent graph pattern in a single pass by using the data structure-oriented approach of the dataset. Evaluation of real-world datasets affirms that our proposed approach can substantially reduce the size of the experimental RDF datasets at approximately 30.52%, 24.92%, and 26.96% when compared with the existing HDT, HDT-FoQ (HDT-Focused on Querying) and the 2Tp (two Triple Predicate based index) approaches. Moreover, the indexing time of our proposed system is also reduced at approximately 17.89%, 13.70%, and 9.32% when compared with the HDT, HDT-FoQ, and 2Tp approaches.
What problem does this paper attempt to address?