PROST: AlphaFold2-aware Sequence-Based Predictor to Estimate Protein Stability Changes upon Missense Mutations
Shahid Iqbal,Fang Ge,Fuyi Li,Tatsuya Akutsu,Yuanting Zheng,Robin B. Gasser,Dong-Jun Yu,Geoffrey Webb,Jiangning Song
DOI: https://doi.org/10.1021/acs.jcim.2c00799
IF: 6.162
2022-01-01
Journal of Chemical Information and Modeling
Abstract:An essential step in engineering proteins and understanding disease-causing missense mutations is to accurately model protein stability changes when such mutations occur. Here, we developed a new sequence-based predictor for the protein stability (PROST) change (Gibb's free energy change, Delta Delta G) upon a single-point missense mutation. PROST extracts multiple descriptors from the most promising sequence-based predictors, such as BoostDDG, SAAFEC-SEQ and DDGun. RPOST also extracts descriptors from iFeature and AlphaFold2. The extracted descriptors include sequence-based features, physicochemical properties, evolutionary information, evolutionary-based physicochemical properties, and predicted structural features. The PROST predictor is a weighted average ensemble model based on extreme gradient boosting (XGBoost) decision trees and an extra-trees regressor; PROST is trained on both direct and hypothetical reverse mutations using the S5294 (S2647 direct mutations + S2647 inverse mutations). The parameters for the PROST model are optimized using grid searching with 5-fold cross-validation, and feature importance analysis unveils the most relevant features. The performance of PROST is evaluated in a blinded manner, employing nine distinct data sets and existing state-of-the-art sequence-based and structure-based predictors. This method consistently performs well on frataxin, S217, 5349, Ssym, 5669, Myoglobin, and CAGIS data sets in blind tests and similarly to the state-of-the-art predictors for p53 and 5276 data sets. When the performance of PROST is compared with the latest predictors such as BoostDDG, SAAFEC-SEQ ACDC-NN-seq, and DDGun, PROST dominates these predictors. A case study of mutation scanning of the frataxin protein for nine wild-type residues demonstrates the utility of PROST. Taken together, these findings indicate that PROST is a well-suited predictor when no protein structural information is available. The source code of PROST, data sets, examples, and pretrained models along with how to use PROST are available at https://github.com/ShahidIqb/PROST and https://prost.erc.monash.edu /seq.