Abstract:Extracting biologically meaningful information from the continuing flood of genomic data is a major challenge in the life sciences. Codon usage bias (CUB) is a general feature of most genomes and is thought to reflect the effects of both natural selection for efficient translation and mutation bias. Here we present a mechanistically interpretable, Bayesian model (ribosome overhead costs Stochastic Evolutionary Model of Protein Production Rate [ROC SEMPPR]) to extract meaningful information from patterns of CUB within a genome. ROC SEMPPR is grounded in population genetics and allows us to separate the contributions of mutational biases and natural selection against translational inefficiency on a gene-by-gene and codon-by-codon basis. Until now, the primary disadvantage of similar approaches was the need for genome scale measurements of gene expression. Here, we demonstrate that it is possible to both extract accurate estimates of codon-specific mutation biases and translational efficiencies while simultaneously generating accurate estimates of gene expression, rather than requiring such information. We demonstrate the utility of ROC SEMPPR using the Saccharomyces cerevisiae S288c genome. When we compare our model fits with previous approaches we observe an exceptionally high agreement between estimates of both codon-specific parameters and gene expression levels ([Formula: see text] in all cases). We also observe strong agreement between our parameter estimates and those derived from alternative data sets. For example, our estimates of mutation bias and those from mutational accumulation experiments are highly correlated ([Formula: see text]). Our estimates of codon-specific translational inefficiencies and tRNA copy number-based estimates of ribosome pausing time ([Formula: see text]), and mRNA and ribosome profiling footprint-based estimates of gene expression ([Formula: see text]) are also highly correlated, thus supporting the hypothesis that selection against translational inefficiency is an important force driving the evolution of CUB. Surprisingly, we find that for particular amino acids, codon usage in highly expressed genes can still be largely driven by mutation bias and that failing to take mutation bias into account can lead to the misidentification of an amino acid's "optimal" codon. In conclusion, our method demonstrates that an enormous amount of biologically important information is encoded within genome scale patterns of codon usage, accessing this information does not require gene expression measurements, but instead carefully formulated biologically interpretable models.

Conserved Codon Composition of Ribosomal Protein Coding Genes in Escherichia Coli, Mycobacterium Tuberculosis and Saccharomyces Cerevisiae: Lessons from Supervised Machine Learning in Functional Genomics

Prediction of Functional Class of Proteins and Peptides Irrespective of Sequence Homology by Support Vector Machines.

Synonymous codon usage defines functional gene families

Predicting synonymous codon usage and optimizing the heterologous gene for expression in E. coli

Re-Annotation of Protein-Coding Genes in the Genome of Saccharomyces Cerevisiae Based on Support Vector Machines

The Relation Between Codon Usage, Base Correlation and Gene Expression Level in Escherichia Coli and Yeast.

Predicting variable gene content in Escherichia coli using conserved genes

Deep Learning Prediction of Ribosome Profiling with Translatomer Reveals Translational Regulation and Interprets Disease Variants

Estimating Gene Expression and Codon-Specific Translational Efficiencies, Mutation Biases, and Selection Coefficients from Genomic Data Alone

Support Vector Machine for Classification of Meiotic Recombination Hotspots and Coldspots in Saccharomyces Cerevisiae Based on Codon Composition

Intragenic spatial patterns of codon usage bias in prokaryotic and eukaryotic genomes

Transcriptome-wide meta-analysis of codon usage in Escherichia coli

Using a Euclid distance discriminant method to find protein coding genes in the yeast genome.

Predicting gene sequences with AI to study codon usage patterns

Selection on codon bias in yeast: a transcriptional hypothesis

Bridging the gap between transcriptome and proteome measurements identifies post-translationally regulated genes

Interpreting protein abundance in Saccharomyces cerevisiae through relational learning

The Relationship Between Synonymous Codon Usage and Protein Structure in Escherichia Coli and Homo Sapiens

Kingdom-Wide Analysis of Fungal Protein-Coding and tRNA Genes Reveals Conserved Patterns of Adaptive Evolution

Predicting Gene Expression Level from Relative Codon Usage Bias: An Application to Escherichia coli Genome

Prediction of Functional Class of Novel Bacterial Proteins Without the Use of Sequence Similarity by a Statistical Learning Method