Abstract:ABSTRACT RNA transcripts are potential therapeutic targets, yet bacterial transcripts have uncharacterized biodiversity. We developed an algorithm for transcript prediction called tp.py using it to predict transcripts (mRNA and other RNAs) in Escherichia coli K12 and E2348/69 strains (Bacteria:gamma-Proteobacteria), Listeria monocytogenes strains Scott A and RO15 (Bacteria:Firmicute), Pseudomonas aeruginosa strains SG17M and NN2 strains (Bacteria:gamma-Proteobacteria), and Haloferax volcanii (Archaea:Halobacteria). From >5 million E. coli K12 and >3 million E. coli E2348/69 newly generated Oxford Nanopore Technologies direct RNA sequencing reads, 2,487 K12 mRNAs and 1,844 E2348/69 mRNAs were predicted, with the K12 mRNAs containing more than half of the predicted E. coli K12 proteins. While the number of predicted transcripts varied by strain based on the amount of sequence data used, across all strains examined, the predicted average size of the mRNAs was 1.6–1.7 kbp, while the median size of the 5′- and 3′-untranslated regions (UTRs) were 30–90 bp. Given the lack of bacterial and archaeal transcript annotation, most predictions were of novel transcripts, but we also predicted many previously characterized mRNAs and ncRNAs, including post-transcriptionally generated transcripts and small RNAs associated with pathogenesis in the E. coli E2348/69 LEE pathogenicity islands. We predicted small transcripts in the 100–200 bp range as well as >10 kbp transcripts for all strains, with the longest transcript for two of the seven strains being the nuo operon transcript, and for another two strains it was a phage/prophage transcript. This quick, easy, and reproducible method will facilitate the presentation of transcripts, and UTR predictions alongside coding sequences and protein predictions in bacterial genome annotation as important resources for the research community. IMPORTANCE Our understanding of bacterial and archaeal genes and genomes is largely focused on proteins since there have only been limited efforts to describe bacterial/archaeal RNA diversity. This contrasts with studies on the human genome, where transcripts were sequenced prior to the release of the human genome over two decades ago. We developed software for the quick, easy, and reproducible prediction of bacterial and archaeal transcripts from Oxford Nanopore Technologies direct RNA sequencing data. These predictions are urgently needed for more accurate studies examining bacterial/archaeal gene regulation, including regulation of virulence factors, and for the development of novel RNA-based therapeutics and diagnostics to combat bacterial pathogens, like those with extreme antimicrobial resistance.

Robust identification of noncoding RNA from transcriptomes requires phylogenetically-informed sampling

Versatile Interactions and Bioinformatics Analysis of Noncoding RNAs

Discovering putative peptides encoded from non-coding RNAs in ribosome profiling data of Arabidopsis thaliana.

A Common Set of Distinct Features That Characterize Noncoding Rnas Across Multiple Species

Identification and analysis of mouse non-coding RNA using transcriptome data

Deciphering transcript architectural complexity in bacteria and archaea

Transcriptomics in the RNA-seq era

De novo computational prediction of non-coding RNA genes in prokaryotic genomes.

Statistical analysis of non-coding RNA data.

Characterizing and Annotating the Genome Using RNA-seq Data

Deciphering Bacterial and Archaeal Transcriptional Dark Matter and Its Architectural Complexity

From Gigabyte to Kilobyte: A Bioinformatics Protocol for Mining Large RNA-Seq Transcriptomics Data

Utilizing Sequence Intrinsic Composition to Classify Protein-Coding and Long Non-Coding Transcripts

The discovery of novel noncoding RNAs in 50 bacterial genomes

De Novo Approach to Classify Protein-Coding and Noncoding Transcripts Based on Sequence Composition.

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

RNA-seq: from technology to biology

Characterization and Identification of Long Non-Coding RNAs Based on Feature Relationship.

Accurate detection of short and long active ORFs using Ribo-seq data

Noncoding RNAs: biology and applications—a Keystone Symposia report

Profiling Caenorhabditis Elegans Non-Coding RNA Expression with a Combined Microarray