Knowledge Rich Natural Language Queries over Structured Biological Databases

Hasan M. Jamil
DOI: https://doi.org/10.48550/arXiv.1703.10692
2017-03-31
Abstract:Increasingly, keyword, natural language and NoSQL queries are being used for information retrieval from traditional as well as non-traditional databases such as web, document, image, GIS, legal, and health databases. While their popularity are undeniable for obvious reasons, their engineering is far from simple. In most part, semantics and intent preserving mapping of a well understood natural language query expressed over a structured database schema to a structured query language is still a difficult task, and research to tame the complexity is intense. In this paper, we propose a multi-level knowledge-based middleware to facilitate such mappings that separate the conceptual level from the physical level. We augment these multi-level abstractions with a concept reasoner and a query strategy engine to dynamically link arbitrary natural language querying to well defined structured queries. We demonstrate the feasibility of our approach by presenting a Datalog based prototype system, called BioSmart, that can compute responses to arbitrary natural language queries over arbitrary databases once a syntactic classification of the natural language query is made.
Databases
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the ability to perform natural - language queries on structured biological databases. Specifically, the paper focuses on how to map natural - language queries to a structured query language (such as SQL) so that information can be retrieved from traditional and non - traditional databases. The paper proposes a multi - level knowledge - based middleware to facilitate this mapping. This middleware separates the conceptual layer from the physical layer and combines a concept reasoner and a query strategy engine to dynamically link any natural - language query to a well - defined structured query. In addition, the paper demonstrates the feasibility of its method through a Datalog - based prototype system, BioSmart, which can calculate responses to any database after syntactic classification of natural - language queries. In short, the main goal of the paper is to develop a system that can understand and execute complex natural - language queries, especially on biological databases in the life - science field, thereby improving the convenience and efficiency of data access. This involves several key technical challenges, including natural - language processing, query optimization, and how to effectively use background knowledge to enhance the quality of query responses.