Improving Inverse Folding models at Protein Stability Prediction without additional Training or Data

Oliver Dutton,Sandro Bottaro,Michele Invernizzi,Istvan Redl,Albert Chung,Falk Hoffmann,Louie Henderson,Stefano Ruschetta,Fabio Airoldi,Benjamin M J Owens,Patrik Foerch,Carlo Fisicaro,Kamil Tamiola
DOI: https://doi.org/10.1101/2024.06.15.599145
2024-09-09
Abstract:Deep learning protein sequence models have shown outstanding performance at de novo protein design and variant effect prediction. We substantially improve performance without further training or use of additional experimental data by introducing a second term derived from the models themselves which align outputs for the task of stability prediction. On a task to predict variants which increase protein stability the absolute success probabilities of P MPNN and ESM are improved by 11% and 5% respectively. We term these models P MPNN- G and ESM - G.
Biophysics
What problem does this paper attempt to address?
The paper aims to address the problem of protein stability prediction, specifically improving the performance of inverse folding models without using additional training or experimental data. Specifically, the paper introduces a new method to enhance the accuracy of two popular models (PROTEIN MPNN and ESM IF) in predicting the impact of single-point mutations on protein stability. The main improvements include: 1. **Utilizing Maximum Sequence Context**: By adjusting the decoding order, the model can use more sequence information, thereby improving prediction accuracy. 2. **Utilizing Logit Perturbation Between Different Inputs**: An additional term is introduced to correct information based solely on backbone atoms, ensuring it does not affect stability prediction. 3. **Reduced Time Complexity**: A new decoding scheme is proposed to reduce computational costs, making large-scale proteome-level mutation stability prediction feasible. With these improvements, the paper demonstrates significant performance enhancements of PROTEIN MPNN-DDG and ESM IF-DDG on three benchmark datasets, and higher efficiency in practical applications.