Identification of new genes on a whole genome scale using saturated reporter transposon mutagenesis

Emily C. A. Goodall,Freya Hodges,Weine Kok,Budi Permana,Thom Cuddihy,Zihao Yang,Nicole Kahler,Kenneth Shires III,Karthik Pullela,Von Vergel L. Torres,Jessica L. Rooke,Antoine Delhaye,Jean-François Collet,Jack A. Bryant,Brian Forde,Matthew Hemm,Ian R. Henderson
DOI: https://doi.org/10.1101/2024.09.06.611592
2024-09-06
Abstract:Small or overlapping genes are prevalent across all domains of life but are often overlooked for annotation and function because of challenges in their detection. The advent of high-density mutagenesis and data-mining studies suggest the existence of further coding potential within bacterial genomes. To overcome limitations in existing protein detection methods, we applied a genetics-based approach. We combined transposon insertion sequencing with a translation reporter to identify translated open reading frames throughout the genome at scale, independent of genome annotation. We applied our method to the well characterised species and identified ∼200 putative novel protein coding sequences (CDS). These are mostly short CDSs (<50 amino acids) and in some cases highly conserved. We validate the expression of selected CDSs demonstrating the utility of this approach. Despite the extensive study of , this method revealed proteins that have not been previously described, including proteins that are conserved and neighbouring functionally important genes, suggesting significant functional roles of small proteins that are still overlooked. We present this as a complementary method to whole cell proteomics and ribosome trapping for condition-dependent identification of protein CDSs. We anticipate this technique will be a starting point for future high-throughput genetics investigations to determine the existence of unannotated genes in multiple bacterial species.
Microbiology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the identification and verification of unannotated small protein - coding sequences (small proteins or short ORFs, sORFs) in bacterial genomes. Specifically, the researchers are concerned with those protein - coding sequences that are often overlooked by existing annotation methods due to their short length. These problems mainly include: 1. **Challenges in Detecting Small Proteins**: Because the sequences of small proteins are short, it is difficult to distinguish them from random open reading frames (ORFs) in bioinformatics predictions. In addition, some small proteins may have species - specific functions, so they are difficult to detect in sequence conservation analysis. 2. **Identification of Nested Genes**: Nested genes refer to genes encoded within a larger gene, and these genes can be encoded in the same or opposite direction as the main gene. Since many automated bacterial genome annotation tools will ignore potential coding sequences located within larger genes, these nested genes are often missed. 3. **Limitations of Existing Methods**: Although there are some high - throughput methods such as Ribo - seq for identifying translated mRNAs and thus determining protein - coding sequences, these methods have challenges in data interpretation, such as distinguishing true coding sequences from the "ubiquitous translation" phenomenon. At the same time, small proteins also face challenges in mass spectrometry detection due to reasons such as short - lived expression, condition - dependence, low abundance or high hydrophobicity. To overcome the above problems, this study has developed a method based on transposon insertion sequencing combined with a translation reporter system, aiming to identify new protein - coding sequences in Escherichia coli on a large scale and independently of existing genome annotations. Through this method, the researchers hope to uncover those proteins that have not yet been described, especially those small proteins that are highly conserved and may have important functions.