Distant Supervision-based Relation Extraction for Literature-Related Biomedical Knowledge Graph Construction
Rui Hua,Xuezhong Zhou,Zixin Shu,Dengying Yan,Kuo Yang,Xinyan Wang,Chuang Cheng,Qiang Zhu
DOI: https://doi.org/10.2174/0122102981269053230921074451
2023-10-06
Current Chinese Science
Abstract:Background:: The task of relation extraction is a crucial component in the construction of a knowledge graph. However, it often necessitates a significant amount of manual annotation, which can be time-consuming and expensive. Distant supervision, as a technique, seeks to mitigate this challenge by generating a large volume of pseudo-training data at a minimal cost, achieved by mapping triple facts onto the raw text. Objective:: The aim of this study is to explore the novelty and potential of the distant supervisionbased relation extraction approach. By leveraging this innovative method, we aim to enhance knowledge reliability and facilitate new knowledge discovery, establishing associations between knowledge from specific biomedical data or existing knowledge graphs and literature. Method:: This study presents a methodology to construct a biomedical knowledge graph employing distant supervision techniques. Through establishing links between knowledge entities and relevant literature sources, we methodically extract and integrate information, thereby expanding and enriching the knowledge graph. This study identified five types of biomedical entities (e.g., diseases, symptoms and genes) and four kinds of relationships. These were linked to PubMed literature and divided into training and testing datasets. To mitigate data noise, the training set underwent preprocessing, while the testing set was manually curated. method: This study introduces a methodology for constructing a biomedical knowledge graph using distant supervision techniques. By establishing connections between knowledge entities and relevant literature sources, we systematically extract and integrate information to expand and enrich the knowledge graph. Results:: In our research, we successfully associated 230,698 triples from the existing knowledge graph with relevant literature. Furthermore, we identified additional 205,148 new triples directly sourced from these studies. Conclusion:: Our study markedly advances the field of biomedical knowledge graph enrichment, particularly in the context of Traditional Chinese Medicine (TCM). By validating a substantial number of triples through literature associations and uncovering over 200,000 new triples, we have made a significant stride in promoting the development of evidence-based medicine in TCM. The results underscore the potential of using a distant supervision-based relation extraction approach to both validate and expand knowledge bases, contributing to the broader progression of evidence-based practices in the realm of TCM. other: Keywords: Relation extraction, knowledge graph, distant supervision, named entity recognition, literature, biomedical knowledge graph.