Automatic Labeling for Gene-Disease Associations through Distant Supervision

Fei Teng,Meng Bai,Tianrui Li
DOI: https://doi.org/10.1109/ISKE47853.2019.9170268
2019-01-01
Abstract:Associating genes with diseases is a fundamental challenge in human health with applications of understanding disease properties and developing precision medicine. Over the past decades, biomedical articles increase explosively, which contain a great number of gene-disease associations (GDAs). Association extraction requires annotated corpus of high accuracy, but manual labeling is time consuming and labor intensive. This paper proposes a distant supervision-based method, to automatically label corpus for GDAs extraction. Compared with the manually annotated gold corpus, the automatic labeled corpus has much larger scale and better quality. It improves the performance of state-of-the-art extraction models, with AUC of 0.96, and F1 of 90%. To the best of our knowledge, this is the first study of automatic labeling GDAs in the field of precision medicine. We extracted GDAs using new corpora from 115,261 PubMed abstracts about 29 lung cancers, and finally discovered 296 new genes/proteins related to lung cancers. These findings indicate new directions for drug design.
What problem does this paper attempt to address?