Building a literature knowledge base towards transparent biomedical AI
Yuanhao Huang,Zhaowei Han,Xin Luo,Xuteng Luo,Yijia Gao,Meiqi Zhao,Feitong Tang,Yiqun Wang,Jiyu Chen,Chengfan Li,Xinyu Lu,Jiahao Qiu,Feiyang Deng,Tiancheng Jiao,Dongxiao Xue,Fan Feng,Thi Hong Ha Vu,Lingxiao Guan,Jean-Philippe Cartailler,Michael Stitzel,Shuibing Chen,Marcela Brissova,Stephen Parker,Jie Liu
DOI: https://doi.org/10.1101/2024.09.22.614323
2024-09-24
Abstract:Knowledge graphs have recently emerged as a powerful data structure to organize biomedical knowledge with explicit representation of nodes and edges. The knowledge representation is in a machine-learning ready format and supports explainable AI models. However, PubMed, the largest and richest biomedical knowledge repository, exists as free text, limiting its utility for advanced machine learning tasks. To address the limitation, we present LiteralGraph, a computational framework that rigorously extracts biomedical terms and relationships from PubMed literature. Using this framework, we established Genomic Literature Knowledge Base (GLKB), a knowledge graph that consolidates 263,714,413 biomedical terms, 14,634,427 biomedical relationships, and 10,667,370 genomic events from 33 million PubMed abstracts and nine well-established biomedical repositories. The database is coupled with RESTful APIs and a user-friendly web interface that make it accessible to researchers for various usages, including machine learning using the semantic knowledge in PubMed articles, reducing hallucination of large language models (LLM), and helping experimental scientists explore their data using vast PubMed evidence.
Bioinformatics