Biophysics-based protein language models for protein engineering

Sam Gelman,Bryce Johnson,Chase Freschlin,Sameer D’Costa,Anthony Gitter,Philip A. Romero
DOI: https://doi.org/10.1101/2024.03.15.585128
2024-03-17
Abstract:Protein language models trained on evolutionary data have emerged as powerful tools for predictive problems involving protein sequence, structure, and function. However, these models overlook decades of research into biophysical factors governing protein function. We propose Mutational Effect Transfer Learning (METL), a protein language model framework that unites advanced machine learning and biophysical modeling. Using the METL framework, we pretrain transformer-based neural networks on biophysical simulation data to capture fundamental relationships between protein sequence, structure, and energetics. We finetune METL on experimental sequence-function data to harness these biophysical signals and apply them when predicting protein properties like thermostability, catalytic activity, and fluorescence. METL excels in challenging protein engineering tasks like generalizing from small training sets and position extrapolation, although existing methods that train on evolutionary signals remain powerful for many types of experimental assays. We demonstrate METL’s ability to design functional green fluorescent protein variants when trained on only 64 examples, showcasing the potential of biophysics-based protein language models for protein engineering.
Bioinformatics
What problem does this paper attempt to address?
This paper proposes a framework called Mutational Effect Transfer Learning (METL) to address the issue that existing protein language models overlook biophysical factors in predicting protein properties. Traditional protein language models are mainly trained on evolutionary data but fail to fully utilize biophysical knowledge about protein function. METL integrates advanced machine learning and biophysical modeling to incorporate the basic relationships between protein sequences, structures, and energies into the model. The paper first generates a large-scale synthetic dataset through molecular simulation to pretrain a Transformer-based neural network, which captures the fundamental connections between protein sequences, structures, and energies. Then, the pretrained model is fine-tuned using experimental sequence-function data to enable it to predict protein characteristics such as thermal stability, enzymatic activity, and fluorescence. METL performs well in challenging protein engineering tasks with small training sets and position extrapolation, particularly in designing functional green fluorescent protein variants, achieving success with only 64 examples used for training. The paper also explores two pretraining strategies, METL-Local and METL-Global, for training specific protein and a wide range of protein sequence space, respectively. Experiments show that METL outperforms other methods in learning from limited data and generalizing to new data, especially when predicting the effects of amino acid mutations that are not present in the training data. Overall, the paper aims to integrate biophysical knowledge into protein language models to improve their ability to predict protein engineering characteristics and generalization performance.