Learning protein fitness landscapes with deep mutational scanning data from multiple sources

Lin Chen,Zehong Zhang,Zhenghao Li,Rui Li,Ruifeng Huo,Lifan Chen,Dingyan Wang,Xiaomin Luo,Kaixian Chen,Cangsong Liao,Mingyue Zheng
DOI: https://doi.org/10.1016/j.cels.2023.07.003
IF: 11.091
2023-08-18
Cell Systems
Abstract:Summary One of the key points of machine learning-assisted directed evolution (MLDE) is the accurate learning of the fitness landscape, a conceptual mapping from sequence variants to the desired function. Here, we describe a multi-protein training scheme that leverages the existing deep mutational scanning data from diverse proteins to aid in understanding the fitness landscape of a new protein. Proof-of-concept trials are designed to validate this training scheme in three aspects: random and positional extrapolation for single-variant effects, zero-shot fitness predictions for new proteins, and extrapolation for higher-order variant effects from single-variant effects. Moreover, our study identified previously overlooked strong baselines, and their unexpectedly good performance brings our attention to the pitfalls of MLDE. Overall, these results may improve our understanding of the association between different protein fitness profiles and shed light on developing better machine learning-assisted approaches to the directed evolution of proteins. A record of this paper's transparent peer review process is included in the supplemental information .
cell biology,biochemistry & molecular biology
What problem does this paper attempt to address?