Abstract:In the era of big data, the volume of semantic data grows rapidly. The large scale semantic data contains a lot of significant but often implicit information that needs to be derived by reasoning. The semantic data reasoning is a challenging process. On one hand, the traditional single-node reasoning systems can hardly cope with such large amount of data due to the resource limitations. On the other hand, the existing large scale reasoning systems are not very efficient and scalable due to the complexity of reasoning process. In this paper, we propose Cichlid, an efficient distributed reasoning engine for the widely-used RDFS and OWL Horst rule sets. Cichlid is built on top of Spark. It implements parallel reasoning algorithms with the Spark RDD programming model. Further, we proposed the optimized parallel RDFS reasoning algorithm from three aspects, including data partition model, the execution order of reasoning rules and removing of duplicated data. Then, for the parallel OWL reasoning process, we optimized the most time-consuming parts, including large-scale data join, the transitive closure computation and the equivalent relation computation. In addition to above optimizations at the reasoning algorithm level, we also optimized the inner Spark execution mechanism by proposing an off-heap memory storage mechanism for RDD. This system-level optimization patch has been accepted and integrated into Apache Spark 1.0. The experimental results show that Cichlid is around 10 times faster on average than the state-of-the-art distributed reasoning systems for both large scale synthetic and real-world benchmarks. The proposed reasoning algorithms and engine also achieve excellent scalability and fault tolerance.

Goldfish: A Large Scale Semantic Data Store and Query System Based on Boolean Matrix Factorization

Large-Scale Real-Time Semantic Processing Framework for Internet of Things

A Scalable Semantic Grid Framework -- VDHA_Grid.

Implementation of large-scale distributed information retrieval system

SparkRDF: Elastic Discreted RDF Graph Processing Engine with Distributed Memory

Scalable RDF store based on HBase and MapReduce

Rainbow: A Distributed and Hierarchical Rdf Triple Store with Dynamic Scalability

D-Ocean: an Unstructured Data Management System for Data Ocean Environment

SparkRDF: In-Memory Distributed RDF Management Framework for Large-Scale Social Data.

Distributed Semantic Web Data Management in HBase and MySQL Cluster

Efficient Distributed Query Processing in Large RFID-enabled Supply Chains

Research on Semantic++ Computing Based on Big Data Environment

Cichlid: Efficient Large Scale RDFS/OWL Reasoning with Spark

A Method of Semantic Web Data Division and Parallel Loading Based on OWL

OceanStore: An Extremely Wide-Area Storage System

Block Storage Optimization and Parallel Data Processing and Analysis of Product Big Data Based on the Hadoop Platform

Semantic-based Big Data integration framework using scalable distributed ontology matching strategy

Online Enhanced Semantic Hashing: Towards Effective and Efficient Retrieval for Streaming Multi-Modal Data

A Semantic++ MapReduce Parallel Programming Model.

A new data-intensive parallel processing framework for spatial data

FISHNET: Financial Intelligence from Sub-querying, Harmonizing, Neural-Conditioning, Expert Swarms, and Task Planning