Fine-tuning BERT models to extract transcriptional regulatory interactions of bacteria from biomedical literature

Alfredo Varela-Vega,Ali-Berenice Posada-Reyes,Carlos-Francisco Méndez-Cruz
DOI: https://doi.org/10.1101/2024.02.19.581094
2024-02-22
Abstract:Curation of biomedical literature has been the traditional approach to extract relevant biological knowledge; however, this is time-consuming and demanding. Recently, Large language models (LLMs) based on pre-trained transformers have addressed biomedical relation extraction tasks outperforming classical machine learning approaches. Nevertheless, LLMs have not been used for the extraction of transcriptional regulatory interactions between transcription factors and regulated elements (genes or operons) of bacteria, a first step to reconstruct a transcriptional regulatory network (TRN). These networks are incomplete or missing for many bacteria. We compared six state-of-the-art BERT architectures (BERT, BioBERT, BioLinkBERT, BioMegatron, BioRoBERTa, LUKE) for extracting this type of regulatory interactions. We fine-tuned 72 models to classify sentences in four categories: , , , and . A dataset of 1562 sentences manually curated from literature of was utilized. The best model of LUKE architecture obtained a relevant performance in the evaluation dataset (Precision: 0.8601, Recall: 0.8788, F1-Score Macro: 0.8685, MCC: 0.8163). An examination of model predictions revealed that the model learned different ways to express the regulatory effect. The model was applied to reconstruct a TRN of Typhimurium using 264 complete articles. We were able to accurately reconstruct 82% of the network. A network analysis confirmed that the transcription factor PhoP regulated many genes (uppermost degree), some of them responsible for antimicrobial resistance. Our work is a starting point to address the limitations of curating regulatory interactions, especially for the reconstruction of TRNs of bacteria or diseases of biological interest.
Molecular Biology
What problem does this paper attempt to address?