Development of Biomedical Corpus Enlargement Platform Using BERT for Bio-entity Recognition

Thiptanawat Phongwattana,Jonathan H. Chan
DOI: https://doi.org/10.1007/978-3-030-36708-4_37
2019-01-01
Abstract:As the volume and availability of textual data dramatically increase in the current digital age, a major challenge is how to properly extract useful information online. A key component of the text mining pipeline is named entity recognition (NER) for extracting knowledge. Currently, there are many publicly available NER tools such as Stanford NLP, NLTK or Spacy python library. However, there is a problem of accurate unknown entity recognition. We focus on using deep learning for recognizing entities, as it has been shown to outperform traditional algorithms for big data in part of its ability for feature extraction and dealing with multi-dimensionality. In this paper, we applied the state-of-the-art language representation model termed BERT (Bidirectional Encoder Representations from Transformers) for NER classification, in order to enlarge the existing biomedical corpus for further machine learning processing. We used additional biomedical corpora for training, and then compared the results to a recent prior work. The end result is precision improvement of 2.24%, recall improvement of 3.55%, and F1-score improvement of 2.98%, in protein recognition of super-pathway of leucine, valine, and isoleucine biosynthesis. We also developed a prototype, in form of an internal web platform, for supporting bio-annotators and corpus enlargement purpose.
What problem does this paper attempt to address?