Human-in-the-loop approach to identify functionally important residues of proteins from literature

Melanie Vollmar,Santosh Tirunagari,Deborah Harrus,David Armstrong,Romana Gaborova,Deepti Gupta,Marcelo Querino Lima Afonso,Genevieve Evans,Sameer Velankar
DOI: https://doi.org/10.1101/2024.03.09.583700
2024-03-13
Abstract:We present a novel system that leverages curators in the loop to develop a dataset and model for detecting residue-level functional annotations and other protein structure features from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, while employing LitSuggest and Huggingface models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from Huggingface. Using a human-in-the-loop annotation system, we developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.
Bioinformatics
What problem does this paper attempt to address?