Ontology-Based Semantic Search for Open Government Data

Shanshan Jiang,Thomas F. Hagelien,Marit Natvig,Jingyue Li
DOI: https://doi.org/10.1109/icosc.2019.8665522
2019-01-01
Abstract:Open data are increasingly available in amount, but often with unprecise or incomplete description. It is time consuming and difficult to discover relevant datasets. Current open data catalogues provide mostly keyword-based search without the ability to understand the user's intent and the contextual meaning of the datasets. Ontology-based semantic search has been well explored in semantic web as an attempt to improve the quality of search for relevant documents and web pages. This paper applies semantic and machine learning technologies to open data. It presents an approach for search of open government datasets, a relatively underexplored domain, where the semantics of data relies on metadata that describes the data. The idea is to link the published datasets with concepts from a well-defined ontology and allow searching based on hybrid indexing. A simplified ontology for the transport domain is constructed to demonstrate and test the idea. A prototype search engine has been implemented which supports both manual and automatic linking to concepts in the ontology and exploits hybrid indexing based on these linking methods. Natural language processing (NLP) techniques are applied to dataset linking and indexing and enable the independency of the natural language used for describing the datasets. The manual linking of datasets to ontology concepts is intended for domain experts and data publishers, while the automatic linking is based on the provided dataset descriptions. The automatic linking reduces the overhead of manual concepts linking and the dependency on domain experts. Preliminary results have indicated that semantic search based on ontologies is a promising approach to increase search quality and efficiency for open data search. The success of the automatic mechanism does however depend on the quality and comprehensiveness of the dataset descriptions.
What problem does this paper attempt to address?