RDFGraphGen: A Synthetic RDF Graph Generator based on SHACL Constraints

Marija Vecovska,Milos Jovanovik
2024-07-25
Abstract:This paper introduces RDFGraphGen, a general-purpose, domain-independent generator of synthetic RDF graphs based on SHACL constraints. The Shapes Constraint Language (SHACL) is a W3C standard which specifies ways to validate data in RDF graphs, by defining constraining shapes. However, even though the main purpose of SHACL is validation of existing RDF data, in order to solve the problem with the lack of available RDF datasets in multiple RDF-based application development processes, we envisioned and implemented a reverse role for SHACL: we use SHACL shape definitions as a starting point to generate synthetic data for an RDF graph. The generation process involves extracting the constraints from the SHACL shapes, converting the specified constraints into rules, and then generating artificial data for a predefined number of RDF entities, based on these rules. The purpose of RDFGraphGen is the generation of small, medium or large RDF knowledge graphs for the purpose of benchmarking, testing, quality control, training and other similar purposes for applications from the RDF, Linked Data and Semantic Web domain. RDFGraphGen is open-source and is available as a ready-to-use Python package.
Software Engineering,Databases
What problem does this paper attempt to address?
The main goal of this paper is to introduce a general, domain-independent synthetic RDF graph generator—RDFGraphGen, which generates synthetic RDF graph data based on SHACL constraints. Specifically, the paper addresses the following issues: 1. **Lack of available RDF datasets**: In the development of RDF applications, specific domain RDF datasets are often needed for testing, benchmarking, etc., but such datasets are not always readily available. 2. **Absence of a general synthetic RDF data generator**: While there are some RDF data generators for specific tasks or domains, there is a lack of a general tool that can generate synthetic RDF data across different domains. To solve these problems, the paper proposes a new approach that reverses the original use of SHACL constraints, which are typically used to validate RDF data, and uses these constraints as the basis for generating synthetic RDF data. This approach allows users to generate synthetic RDF data that conforms to the shape requirements defined by SHACL shapes from any domain. RDFGraphGen has the following features: - **Generality**: It can generate RDF data for any domain as long as the corresponding SHACL shape definitions are provided. - **Flexibility**: Users can specify the number of entities in the generated dataset, thus creating small, medium, or large knowledge graphs. - **Open Source**: RDFGraphGen is an open-source project and is available to users as a Python package. In this way, RDFGraphGen aims to meet various application scenarios in the fields of RDF, Linked Data, and the Semantic Web, such as benchmarking, quality control, training machine learning models, and more.