Real-world data medical knowledge graph: construction and applications
Linfeng Li,Peng Wang,Jun Yan,Yao Wang,Simin Li,Jinpeng Jiang,Zhe Sun,Buzhou Tang,Tsung-Hui Chang,Shenghui Wang,Yuting Liu
DOI: https://doi.org/10.1016/j.artmed.2020.101817
IF: 7.011
2020-03-01
Artificial Intelligence in Medicine
Abstract:<h3 class="u-h4 u-margin-m-top u-margin-xs-bottom">Objective</h3><p>Medical knowledge graph (KG) is attracting attention from both academic and healthcare industry due to its power in intelligent healthcare applications. In this paper, we introduce a systematic approach to build medical KG from electronic medical records (EMRs) with evaluation by both technical experiments and end to end application examples.</p><h3 class="u-h4 u-margin-m-top u-margin-xs-bottom">Materials and Methods</h3><p>The original data set contains 16,217,270 de-identified clinical visit data of 3,767,198 patients. The KG construction procedure includes 8 steps, which are data preparation, entity recognition, entity normalization, relation extraction, property calculation, graph cleaning, related-entity ranking, and graph embedding respectively. We propose a novel quadruplet structure to represent medical knowledge instead of the classical triplet in KG. A novel related-entity ranking function considering probability, specificity and reliability (PSR) is proposed. Besides, probabilistic translation on hyperplanes (PrTransH) algorithm is used to learn graph embedding for the generated KG.</p><h3 class="u-h4 u-margin-m-top u-margin-xs-bottom">Results</h3><p>A medical KG with 9 entity types including disease, symptom, etc. was established, which contains 22,508 entities and 579,094 quadruplets. Compared with term frequency - inverse document frequency (TF/IDF) method, the normalized discounted cumulative gain () increased from 0.799 to 0.906 with the proposed ranking function. The embedding representation for all entities and relations were learned, which are proven to be effective using disease clustering.</p><h3 class="u-h4 u-margin-m-top u-margin-xs-bottom">Conclusion</h3><p>The established systematic procedure can efficiently construct a high-quality medical KG from large-scale EMRs. The proposed ranking function PSR achieves the best performance under all relations, and the disease clustering result validates the efficacy of the learned embedding vector as entity's semantic representation. Moreover, the obtained KG finds many successful applications due to its statistics-based quadruplet.</p><p>where <span class="math"><math>Ncomin</math></span> is a minimum co-occurrence number and <em>R</em> is the basic reliability value. The reliability value can measure how reliable is the relationship between <em>S<sub>i</sub></em> and <em>O<sub>ij</sub></em>. The reason for the definition is the higher value of <em>N</em><sub>co</sub>(<em>S<sub>i,</sub> O<sub>ij</sub></em>), the relationship is more reliable. However, the reliability values of the two relationships should not have a big difference if both of their co-occurrence numbers are very big. In our study, we finally set <span class="math"><math>Ncomin</math></span> = 10 and <em>R</em> = 1 after some experiments. For instance, if co-occurrence numbers of three relationships are 1, 100 and 10000, their reliability values are 1, 2.96 and 5 respectively.</p>
engineering, biomedical,computer science, artificial intelligence,medical informatics