Decoding Stability and Epistasis in Human Myoglobin by Deep Mutational Scanning and Codon-level Machine Learning

Christoph Küng,Olena Protsenko,Rosario Vanella,Michael A. Nash
DOI: https://doi.org/10.1101/2024.02.24.581358
2024-03-06
Abstract:Understanding the linkage between protein sequence and phenotypic expression level is crucial in biotechnology. Machine learning algorithms trained with deep mutational scanning (DMS) data have significant potential to improve this understanding and accelerate protein engineering campaigns. However, most machine learning (ML) approaches in this domain do not directly address effects of synonymous codons or positional epistasis on predicted expression levels. Here we used yeast surface display, deep mutational scanning, and next-generation DNA sequencing to quantify the expression fitness landscape of human myoglobin and train ML models to predict epistasis of double codon mutants. When fed with near comprehensive single mutant DMS data, our algorithm computed expression fitness values for double codon mutants using ML-predicted epistasis as an intermediate parameter. We next deployed this predictive model to screen > 3·10 unseen double codon mutants and experimentally tested highly ranked candidate sequences, finding 14 of 16 with significantly enhanced expression levels. Our experimental DMS dataset combined with codon level epistasis-based ML constitutes an effective method for bootstrapping fitness predictions of high order mutational variants using experimental data from variants of lower order.
Bioengineering
What problem does this paper attempt to address?
This paper attempts to address how to quantify the stability and epigenetic effects of human myoglobin through deep mutational scanning and codon-based machine learning, predict the expression fitness of double-codon mutants, and improve the predictive capability of high-order mutants in protein engineering. The study combines experimental and computational methods to enhance the understanding of the relationship between protein sequence and function.