Inverse folding based pre-training for the reliable identification of intrinsic transcription terminators

Vivian B. Brandenburg,Franz Narberhaus,Axel Mosig
DOI: https://doi.org/10.1371/journal.pcbi.1010240
2022-07-08
PLoS Computational Biology
Abstract:It is well-established that neural networks can predict or identify structural motifs of non-coding RNAs (ncRNAs). Yet, the neural network based identification of RNA structural motifs is limited by the availability of training data that are often insufficient for learning features of specific ncRNA families or structural motifs. Aiming to reliably identify intrinsic transcription terminators in bacteria, we introduce a novel pre-training approach that uses inverse folding to generate training data for predicting or identifying a specific family or structural motif of ncRNA. We assess the ability of neural networks to identify secondary structure by systematic in silico mutagenesis experiments. In a study to identify intrinsic transcription terminators as functionally well-understood RNA structural motifs, our inverse folding based pre-training approach significantly boosts the performance of neural network topologies, which outperform previous approaches to identify intrinsic transcription terminators. Inverse-folding based pre-training provides a simple, yet highly effective way to integrate the well-established thermodynamic energy model into deep neural networks for identifying ncRNA families or motifs. The pre-training technique is broadly applicable to a range of network topologies as well as different types of ncRNA families and motifs. Intrinsic transcriptional terminators are essential regulators in determining the 3' end of transcripts in bacteria. The underlying mechanism involves RNA secondary structure, where nucleotides fold into a specific hairpin motif. Identifying terminator sequences in bacterial genomes has conventionally been approached with well-established energy models for structural motifs. However, the folding mechanism of transcription terminators is understood only partially, limiting the success of energy-model based identification. Neural networks have been proposed to overcome these limitations. However, their adoption for predicting and identifying RNA secondary structure has been a double edged sword: Neural networks promise to learn features that are not represented by the energy models, while they are black boxes that lack explicit modeling assumptions and may fail to account for features that are well understandable based on decades-old energy models. Here, we introduce a pre-training approach for neural networks that uses energy-model based inverse folding of structural motifs. As we demonstrate, this approach "brings back the energy model" to identify transcriptional terminators and overcomes the limitations of previous energy-model based predictions. Our approach works for diverse types of neural networks, and is suitable for the identification of structural motifs of many other RNA molecules beyond transcriptional terminators.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?