Abstract:Objective: SNOMED CT is the international lingua franca of terminologies for human health. Based in Description Logics (DL), the terminology enables data queries that incorporate inferences between data elements, as well as, those relationships that are explicitly stated. However, the ontologic and polyhierarchical nature of the SNOMED CT concept model make it difficult to implement in its entirety within electronic health record systems that largely employ object oriented or relational database architectures. The result is a reduction of data richness, limitations of query capability and increased systems overhead. The hypothesis of this research was that a graph database (graph DB) architecture using SNOMED CT as the basis for the data model and subsequently modeling patient data upon the semantic core of SNOMED CT could exploit the full value of the terminology to enrich and support advanced data querying capability of patient data sets. Methods: The hypothesis was tested by instantiating a graph DB with the fully classified SNOMED CT concept model. The graph DB instance was tested for integrity by calculating the transitive closure table for the SNOMED CT hierarchy and comparing the results with transitive closure tables created using current, validated methods. The graph DB was then populated with 461,171 anonymized patient record fragments and over 2.1 million associated SNOMED CT clinical findings. Queries, including concept negation and disjunction, were then run against the graph database and an enterprise Oracle relational database (RDBMS) of the same patient data sets. The graph DB was then populated with laboratory data encoded using LOINC, as well as, medication data encoded with RxNorm and complex queries performed using LOINC, RxNorm and SNOMED CT to identify uniquely described patient populations. Results: A graph database instance was successfully created for two international releases of SNOMED CT and two US SNOMED CT editions. Transitive closure tables and descriptive statistics generated using the graph database were identical to those using validated methods. Patient queries produced identical patient count results to the Oracle RDBMS with comparable times. Database queries involving defining attributes of SNOMED CT concepts were possible with the graph DB. The same queries could not be directly performed with the Oracle RDBMS representation of the patient data and required the creation and use of external terminology services. Further, queries of undefined depth were successful in identifying unknown relationships between patient cohorts. Conclusion: The results of this study supported the hypothesis that a patient database built upon and around the semantic model of SNOMED CT was possible. The model supported queries that leveraged all aspects of the SNOMED CT logical model to produce clinically relevant query results. Logical disjunction and negation queries were possible using the data model, as well as, queries that extended beyond the structural IS_A hierarchy of SNOMED CT to include queries that employed defining attribute-values of SNOMED CT concepts as search parameters. As medical terminologies, such as SNOMED CT, continue to expand, they will become more complex and model consistency will be more difficult to assure. Simultaneously, consumers of data will increasingly demand improvements to query functionality to accommodate additional granularity of clinical concepts without sacrificing speed. This new line of research provides an alternative approach to instantiating and querying patient data represented using advanced computable clinical terminologies.

The CALBC RDF Triple store: retrieval over large literature content

Assessment of NER Solutions Against the First and Second CALBC Silver Standard Corpus

Application of the RDF framework to integrate heterogenous experimental data of a large chemo- and biodiverse collection from a research collaborative project

Enabling Published Taxonomic Data to be used to Address the Biodiversity Crisis: Biodiversity Literature Repository and TreatmentBank

A journey to Semantic Web query federation in the life sciences

OC-2-KB: A Software Pipeline to Build an Evidence-Based Obesity and Cancer Knowledge Base

The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop

The BioLexicon: a large-scale terminological resource for biomedical text mining

SIB Literature Services: RESTful customizable search engines in biomedical literature, enriched with automatically mapped biomedical concepts

RRD-Bio: Building An Integrated Research Resource Database for Biomedicine

Web services-based text-mining demonstrates broad impacts for interoperability and process simplification

qEndpoint: A novel triple store architecture for large RDF graphs

Efficiently querying rdf data in triple stores.

A large-scale evaluation of NLP-derived chemical-gene/protein relationships from the scientific literature: Implications for knowledge graph construction

SemOpenAlex: The Scientific Landscape in 26 Billion RDF Triples

Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature

BioCreative V CDR task corpus: a resource for chemical disease relation extraction

Large-Scale Knowledge Synthesis and Complex Information Retrieval from Biomedical Documents

Bibliometric Data Fusion for Biomedical Information Retrieval

An alternative database approach for management of SNOMED CT and improved patient data queries

CADRE: A Collaborative, Cloud-Based Solution for Big Bibliographic Data Research in Academic Libraries