Utilization of Electronic Medical Records and Biomedical Literature to Support Rare Disease Diagnosis (Preprint)
Feichen Shen,Sijia Liu,Yanshan Wang,Andrew Wen,Liwei Wang,Hongfang Liu
DOI: https://doi.org/10.2196/preprints.11301
2018-06-15
Abstract:BACKGROUND In the United States, rare diseases are defined as those affecting fewer than 200,000 patients at any given time. Patients with rare diseases are frequently either misdiagnosed or left undiagnosed, possibly due in part to a lack of knowledge or experience with the rare disease on the part of care providers. With an exponentially growing volume of electronically accessible medical data, a large volume of information on thousands of rare diseases and their potentially associated diagnostic information is buried in electronic medical records (EMRs) and medical literature. OBJECTIVE We hypothesize that patients’ phenotypic information available within these heterogeneous resources (e.g., electronic medical records and biomedical literature) can be leveraged to accelerate disease diagnosis. In this study, we aimed to leverage information contained in heterogeneous datasets to assist rare disease diagnosis. METHODS In a previous study, we proposed utilizing a collaborative filtering recommendation system enriched with natural language processing and semantic techniques to assist rare disease diagnosis based on phenotypic characterizations derived solely from EMR data. In this study, in order to further investigate the performance of collaborative filtering on heterogeneous datasets, we studied EMR data generated at Mayo Clinic as well as published article abstracts retrieved from the Semantic MEDLINE Database. Specifically, in this study, we applied Tanimoto coefficient similarity, overlap coefficient similarity, Fager & McGowan coefficient similarity, and log likelihood ratio similarity with K nearest neighbor and threshold based patient neighbor algorithms on various combinations of datasets. RESULTS We evaluated different approaches to this problem using characterizations derived from various combinations of EMR data and literature, as well as with solely EMR data. We extracted 12.8 million EMRs from the Mayo Clinic unstructured patient cohort generated between 2010 through 2015 and retrieved all article abstracts from the semi-structured Semantic MEDLINE Database that were published through the end of 2016. We applied a collaborative filtering model and compared the performance generated by different metrics. Log likelihood ratio similarity combined with K nearest neighbor on heterogeneous datasets showed the optimal performance in patient recommendation with PRAUC 0.475 (string match), 0.511 (SNOMED match), and 0.752 (GARD match). Log likelihood ratio similarity also performed the best with mean average precision 0.465 (string match), 0.5 (SNOMED match), and 0.749 (GARD match). Performance of rare disease prediction was also demonstrated by using the optimal algorithm. Macro-average F-measure for string, SNOMED-CT, and GARD match were 0.32, 0.42, and 0.63, respectively. CONCLUSIONS This study demonstrated potential utilization of heterogeneous datasets in a collaborative filtering model to support rare disease diagnosis. In addition to phenotypic-based analysis, in the future, we plan to resolve the heterogeneity issue and reduce miscommunication between EMR and literature by mining genotypic information to establish a comprehensive disease-phenotype-gene network for rare disease diagnosis.