Evolutionary context-integrated deep sequence modeling for protein engineering

Yunan Luo,Lam Vo,Hantian Ding,Yufeng Su,Yang Liu,Wesley Wei Qian,Huimin Zhao,Jian Peng
DOI: https://doi.org/10.1101/2020.01.16.908509
2020-01-17
Abstract:Abstract Protein engineering seeks to design proteins with improved or novel functions. Compared to rational design and directed evolution approaches, machine learning-guided approaches traverse the fitness landscape more effectively and hold the promise for accelerating engineering and reducing the experimental cost and effort. A critical challenge here is whether we are capable of predicting the function or fitness of unseen protein variants. By learning from the sequence and large-scale screening data of characterized variants, machine learning models predict functional fitness of sequences and prioritize new variants that are very likely to demonstrate enhanced functional properties, thereby guiding and accelerating rational design and directed evolution. While existing generative models and language models have been developed to predict the effects of mutation and assist protein engineering, the accuracy of these models is limited due to their unsupervised nature of the general sequence contexts they captured that is not specific to the protein being engineered. In this work, we propose ECNet, a deep-learning algorithm to exploit evolutionary contexts to predict functional fitness for protein engineering. Our method integrated local evolutionary context from homologous sequences that explicitly model residue-residue epistasis for the protein of interest, as well as the global evolutionary context that encodes rich semantic and structural features from the enormous protein sequence universe. This biologically motivated sequence modeling approach enables accurate mapping from sequence to function and provides generalization from low-order mutants to higher-orders. Through extensive benchmark experiments, we showed that our method outperforms existing methods on ∼50 deep mutagenesis scanning and random mutagenesis datasets, demonstrating its potential of guiding and expediting protein engineering.
What problem does this paper attempt to address?