Abstract:As the volume of the RDF data becomes increasingly large, it is essential for us to design a distributed database system to manage it. For distributed RDF data design, it is quite common to partition the RDF data into some parts, called fragments, which are then distributed. Thus, the distribution design consists of two steps: fragmentation and allocation. In this paper, we propose a method to explore the intrinsic similarities among the structures of queries in a workload for fragmentation and allocation, which aims to reduce the number of crossing matches and the communication cost during SPARQL query processing. Specifically, we mine and select some frequent access patterns to reflect the characteristics of the workload. Here, although we prove that selecting the optimal set of frequent access patterns is NP-hard, we propose a heuristic algorithm which guarantees both the data integrity and the approximation ratio. Based on the selected frequent access patterns, we propose two fragmentation strategies, vertical and horizontal fragmentation strategies, to divide RDF graphs while meeting different kinds of query processing objectives. Vertical fragmentation is for better throughput and horizontal fragmentation is for better performance. After fragmentation, we discuss how to allocate these fragments to various sites. Finally, we discuss how to process a query based on the results of fragmentation and allocation. Extensive experiments confirm the superior performance of our proposed solutions.

Scalable SAPRQL querying processing on large RDF data in cloud computing environment

Scalable RDF store based on HBase and MapReduce

SparkRDF: Elastic Discreted RDF Graph Processing Engine with Distributed Memory

RCFile: A Fast and Space-Efficient Data Placement Structure in MapReduce-based Warehouse Systems

Gsmat: A Scalable Sparse Matrix-based Join for SPARQL Query Processing

A partition-based Summary-Graph-Driven Method for Efficient RDF Query Processing

The performance of MapReduce: an in-depth study

The Performance of MapReduce

Query Workload-based RDF Graph Fragmentation and Allocation

A Hybrid Approach Combining R*-Tree and k-d Trees to Improve Linked Open Data Query Performance

Rainbow: A Distributed and Hierarchical Rdf Triple Store with Dynamic Scalability

Adaptive Distributed RDF Graph Fragmentation and Allocation Based on Query Workload

A Survey of Distributed RDF Data Management

Query optimization for massively parallel data processing.

Super Rack: Reusing the Results of Queries in MapReduce Systems

SparkRDF: In-Memory Distributed RDF Management Framework for Large-Scale Social Data.

Scalable Parallel Join for Huge Tables

SPARQL Query Parallel Processing: A Survey

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

Reusing the Results of Queries in MapReduce Systems by Adopting Shared Storage.

Hybrid storage architecture and efficient MapReduce processing for unstructured data