GrantExtractor: A Winning System for Extracting Grant Support Information from Biomedical Literature

Suyang Dai,Zihan Zhang,Wenxuan Zuo,Xiaodi Huang,Shanfeng Zhu
DOI: https://doi.org/10.1109/bibm.2018.8621579
2018-01-01
Abstract:As the important information in MEDLINE database, grant support (GS) refers to funding agencies and contract numbers. For funding organizations, GS plays a crucial role in tracking their funding outcomes. In this paper, we present a pipeline system called GrantExtractor that is able to automatically extract funding information from biomedical literature. GrantExtractor is a novel solution to the practical problem of GS information extraction, which is related to both name entity recognition and relation extraction. Our approaches rely on an integration of several modern machine learning techniques. In particular, funding sentences in articles are first identified by a sentence classifier. Entities of grant numbers and agencies are then extracted from these funding sentences by a bi-directional LSTM and the CRF layer (BiLSTM-CRF), as well as pattern matching. After removing noisy numbers by a multi-class model, we finally match each grant number with its corresponding agency. Experimental results on benchmark datasets show that GrantExtractor clearly outperformed all baseline methods. In addition, GrantExtractor won the first place in Task 5C of 2017 BioASQ challenge, achieving the Micro-recall of 0.9526 for 22,610 articles. This number is 33% higher than 0.7174, which is the highest score as the baseline of“BioASQ Filtering” provided by National Library of Medicine (NLM). Moreover, GrantExtractor has achieved the Micro F-measure score as high as 0.90 in the task of extracting grant pairs.
What problem does this paper attempt to address?