A span-based joint model for extracting entities and relations of bacteria biotopes

Mei Zuo,Yang Zhang
DOI: https://doi.org/10.1093/bioinformatics/btab593
IF: 5.8
2021-12-22
Bioinformatics
Abstract:Motivation: Information about bacteria biotopes (BB) is important for fundamental research and applications in microbiology. BB task at BioNLP-OST 2019 focuses on the extraction of locations and phenotypes of microorganisms from PubMed abstracts and full-text excerpts. The subtask BB-rel+ner aims to recognize relevant entities and extract interrelationships about BBs. The corresponding corpus owns some distinctive features (e.g. nested entities) which are challenging to deal with. Therefore, previous methods achieved low performance on entity and relation extraction and limited the mutual effect between named entity recognition and relation extraction. There is still much room for improvement. Results: We propose a span-based model to extract entities and relations jointly from biomedical text regarding the BBs. For alleviating the problem of annotated data deficiency in domain-specific task, we employ a BERT (Bidirectional Encoder Representations from Transformers) model pre-trained on the domain-specific corpus to encode sentences. Our model considers all spans in a sentence as potential entity mentions and computes relation scores between the most confident entity spans based on representations of spans and contexts between spans. Experiments on the BB-rel+ner 2019 corpus demonstrate that our model achieves significantly better performance than the state-of-the-art method, with a reduction of 21.6% slot error rate (SER) for extracting relations. Our model is also effective in recognizing nested entities. Furthermore, the model can be applied to the CHEMPROT corpus for joint extraction of chemical-protein entities and relations, achieving state-of-the-art performance. Availability and implementation: Our source code is available at https://github.com/zmmzGitHub/SpanMB_BERT. Supplementary information: Supplementary data are available at Bioinformatics online.
What problem does this paper attempt to address?