BacTermFinder: A Comprehensive and General Bacterial Terminator Finder using a CNN Ensemble

Seyed Mohammad Amin Taheri Ghahfarokhi,Lourdes Pena-Castillo
DOI: https://doi.org/10.1101/2024.07.05.602086
2024-07-08
Abstract:A terminator is a DNA region that ends the transcription process. Currently, multiple computational tools are available for predicting bacterial terminators. However, these methods are specialized for certain bacteria or terminator type (i.e., intrinsic or factor-dependent). In this work, we developed BacTermFinder using an ensemble of Convolutional Neural Networks (CNNs) receiving as input four different representations of terminator sequences. To develop BacTermFinder, we collected roughly 41k bacterial terminators (intrinsic and factor-dependent) of 22 species with varying GC-content (from 28% to 71%) from published studies that used RNA-seq technologies. We evaluated BacTermFinder's performance on terminators of five bacterial species (not used for training BacTermFinder) and two archaeal species. BacTermFinder's performance was compared with that of four other bacterial terminator prediction tools. Based on our results, BacTermFinder outperforms all other four approaches in terms of average recall without increasing the number of false positives. Moreover, BacTermFinder identifies both types of terminators (intrinsic and factor-dependent) and generalizes to archaeal terminators. Additionally, we visualized the saliency map of the CNNs to gain insights on terminator motif per species. BacTermFinder is publicly available at https://github.com/BioinformaticsLabAtMUN/BacTermFinder
Bioinformatics
What problem does this paper attempt to address?
This paper focuses on the development of a comprehensive and general-purpose tool for predicting bacterial terminators, called BacTermFinder, which utilizes convolutional neural network (CNN) ensembles to handle different types of terminator sequences. Many existing prediction tools are specialized in specific bacteria or specific types of terminators (e.g., intrinsic or factor-dependent). BacTermFinder collects approximately 41,000 terminator sequences from multiple bacterial species (with GC contents ranging from 28% to 71%) and trains and evaluates models based on these data. The research results show that BacTermFinder has a higher average recall rate than the other four methods without increasing false positives, and it can identify two types of terminators as well as archaeal terminators. Furthermore, the authors visualize the saliency maps of the CNN to understand the terminator patterns of different species. As a species- and terminator type-independent tool, BacTermFinder improves the accuracy of bacterial terminator prediction and is publicly available on GitHub.