LORE: A Literature Semantics Framework for Evidenced Disease-Gene Pathogenicity Prediction at Scale

Peng-Hsuan Li,Yih-Yun Sun,Hsueh-Fen Juan,Chien-Yu Chen,Huai-Kuang Tsai,Jia-Hsin Huang
DOI: https://doi.org/10.1101/2024.08.10.24311801
2024-08-11
Abstract:Effective utilization of academic literature is crucial for Machine Reading Comprehension to generate actionable scientific knowledge for wide real-world applications. Recently, Large Language Models (LLMs) have emerged as a powerful tool for distilling knowledge from scientific articles, but they struggle with the issues of reliability and verifiability. Here, we propose LORE, a novel unsupervised two-stage reading methodology with LLM that models literature as a knowledge graph of verifiable factual statements and, in turn, as semantic embeddings in Euclidean space. Applied to PubMed abstracts for large-scale understanding of disease-gene relationships, LORE captures essential information of gene pathogenicity. Furthermore, we demonstrate that modeling a latent pathogenic flow in the semantic embedding with supervision from the ClinVar database leads to a 90% mean average precision in identifying relevant genes across 2,097 diseases. Finally, we have created a disease-gene relation knowledge graph with predicted pathogenicity scores, 200 times larger than the ClinVar database.
Health Informatics
What problem does this paper attempt to address?