Accelerated variant curation from scientific literature using biomedical text mining

Rishab Mallick,Valerio Arnaboldi,Paul Davis,Stavros Diamantakis,Magdalena Zarowiecki,Kevin Howe
DOI: https://doi.org/10.17912/micropub.biology.000578
2022-06-01
Abstract:Biological databases collect and standardize data through biocuration. Even though major model organism databases have adopted some automation of curation methods, a large portion of biocuration is still performed manually. To speed up the extraction of the genomic positions of variants, we have developed a hybrid approach that combines regular expressions, Named Entity Recognition based on BERT (Bidirectional Encoder Representations from Transformers) and bag-of-words to extract variant genomic locations from C. elegans papers for WormBase. Our model has a precision of 82.59% for the gene-mutation matches tested on extracted text from 100 papers, and even recovers some data not discovered during manual curation. Code at: https://github.com/WormBase/genomic-info-from-papers.
What problem does this paper attempt to address?