PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated protein solubility dataset

Xuechun Zhang,Xiaoxuan Hu,Tongtong Zhang,Ling Yang,Chunhong Liu,Ning Xu,Haoyi Wang,Wen Sun
DOI: https://doi.org/10.1101/2024.04.22.590218
2024-04-24
Abstract:Protein solubility plays a crucial role in various biotechnological, industrial and biomedical applications. With the reduction in sequencing and gene synthesis costs, the adoption of high-throughput experimental screening coupled with tailored bioinformatic prediction has witnessed a rapidly growing trend for the development of novel functional enzymes of interest (EOI). High protein solubility rates are essential in this process and accurate prediction of solubility is a challenging task. As deep learning technology continues to evolve, attention-based protein language models (PLMs) can extract intrinsic information from protein sequences to a greater extent. Leveraging these models along with the increasing availability of protein solubility data inferred from structural database like the Protein Data Bank (PDB), holds great potential to enhance the prediction of protein solubility. In this study, we curated an Updated ( ) protein Solubility DataSet (UESolDS) and employed a combination of multiple PLMs and classification layers to predict protein solubility. The resulting best-performing model, named Protein Language Model-based protein Solubility prediction model (PLM_Sol), demonstrated significant improvements over previous reported models, achieving a notable 5.7% increase in accuracy, 9% increase in F1_score, and 10.4% increase in MCC score on the independent test set. Moreover, additional evaluation utilizing our in-house synthesized protein resource as test data, encompassing diverse types of enzymes, also showcased the superior performance of PLM_Sol. Overall, PLM_Sol exhibited consistent and promising performance across both independent test set and experimental set, thereby making it well-suited for facilitating large-scale EOI studies. PLM_Sol is available as a standalone program and as an easy-to-use model at .
Bioinformatics
What problem does this paper attempt to address?
The paper primarily focuses on addressing the issue of predicting protein solubility in Escherichia coli (E. coli). Specifically, the research team made improvements to several key issues present in existing prediction models: 1. **Quality of the dataset**: Existing datasets suffer from being outdated, having unclear annotations, and data contamination. To address this, the researchers integrated data from multiple databases (such as TargetTrack, DNASU, eSOL, and PDB) and created an updated and higher-quality dataset called UESolDS through a series of rigorous data cleaning steps. 2. **Performance and usability of the models**: Some previous models are no longer available or require additional input information (such as protein secondary structure information), which limits their applicability. Additionally, the architecture of some models may not fully leverage the latest advancements in deep learning technology. Therefore, this study developed a new model named PLM_Sol, which combines multiple protein language models (PLMs) and different classification layers to improve prediction accuracy. 3. **Experimental validation**: To verify the effectiveness and generalization ability of PLM_Sol, the research team also conducted experimental tests on a set of internally synthesized protein samples belonging to different enzyme family types. The results showed that PLM_Sol demonstrated significant performance improvements on both independent test sets and experimental datasets. In summary, this paper aims to improve the accuracy of protein solubility prediction by enhancing the quality of the training dataset and utilizing advanced deep learning methods, thereby supporting large-scale enzyme screening research.