VOYAGE: A Large Collection of Vocabulary Usage in Open RDF Datasets

Qi Shi,Jianli Wang,Jeff Z. Pan,Gong Cheng
DOI: https://doi.org/10.5281/zenodo.7902674
2023-01-01
Abstract:List of files: odps.json: for each of the accessed ODPs, its name, URL, API type, API URL, and the IDs of RDF datasets collected from it JSON structure: a list of objects, where each object contains the following attributes - 'name' (string), 'URL' (string), 'API type' (string), 'API URL' (string), and 'collected datasets IDs' (list of integers) datasets.json: for each of the crawled RDF datasets, its ID, title, description, author, license, dump file URLs, and PLDs JSON structure: a list of objects, where each object contains the following attributes - 'ID' (integer), 'title' (string), 'description' (string), 'author' (string), 'license' (string), 'dump file URLs' (list of strings), and 'PLDs' (list of strings) deduplicated_datasets.json: the IDs of the deduplicated RDF datasets and whether they are in the LOD Cloud JSON structure: a list of objects, where each object contains the following attributes - 'ID' (integer) and 'in LOD Cloud' (boolean) terms.json: the extracted classes, properties, and the IDs of RDF datasets using each term JSON structure: a list of objects, where each object contains the following attributes - 'term' (string), 'is class' (boolean), 'is property' (boolean), and 'used in dataset IDs' (list of integers) vocabularies.json: the extracted vocabularies, the classes and properties in each vocabulary, and the IDs of RDF datasets using each vocabulary JSON structure: a list of objects, where each object contains the following attributes - 'vocabulary' (string), 'classes' (list of strings), 'properties' (list of strings), and 'used in dataset IDs' (list of integers). edps.json: the extracted distinct EDPs and the IDs of RDF datasets using each EDP JSON structure: a list of objects, where each object contains the following attributes - 'classes' (list of strings), 'forward properties' (list of strings), 'backward properties' (list of strings), and 'used in dataset IDs' (list of integers) clusters.json: the clusters of vocabularies generated by MV-ITCC and LDA JSON structure: {"LDA": {"vocabularies": {VOCABULARY_CLUSTER_ID_1: [LIST_OF_VOCABULARIES], VOCABULARY_CLUSTER_ID_2: [LIST_OF_VOCABULARIES], ...}}, "MV-ITCC": {"vocabularies": {VOCABULARY_CLUSTER_ID_1: [LIST_OF_VOCABULARIES], VOCABULARY_CLUSTER_ID_2: [LIST_OF_VOCABULARIES], ...}, "dataset IDs": {DATASET_CLUSTER_ID_1: [LIST_OF_DATASET_IDS], DATASET_CLUSTER_ID_2: [LIST_OF_DATASET_IDS], ...}}}
What problem does this paper attempt to address?