PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated protein solubility dataset

Xuechun Zhang,Xiaoxuan Hu,Tongtong Zhang,Ling Yang,Chunhong Liu,Ning Xu,Haoyi Wang,Wen Sun

DOI: https://doi.org/10.1101/2024.04.22.590218

2024-04-24

Abstract:Protein solubility plays a crucial role in various biotechnological, industrial and biomedical applications. With the reduction in sequencing and gene synthesis costs, the adoption of high-throughput experimental screening coupled with tailored bioinformatic prediction has witnessed a rapidly growing trend for the development of novel functional enzymes of interest (EOI). High protein solubility rates are essential in this process and accurate prediction of solubility is a challenging task. As deep learning technology continues to evolve, attention-based protein language models (PLMs) can extract intrinsic information from protein sequences to a greater extent. Leveraging these models along with the increasing availability of protein solubility data inferred from structural database like the Protein Data Bank (PDB), holds great potential to enhance the prediction of protein solubility. In this study, we curated an Updated ( ) protein Solubility DataSet (UESolDS) and employed a combination of multiple PLMs and classification layers to predict protein solubility. The resulting best-performing model, named Protein Language Model-based protein Solubility prediction model (PLM_Sol), demonstrated significant improvements over previous reported models, achieving a notable 5.7% increase in accuracy, 9% increase in F1_score, and 10.4% increase in MCC score on the independent test set. Moreover, additional evaluation utilizing our in-house synthesized protein resource as test data, encompassing diverse types of enzymes, also showcased the superior performance of PLM_Sol. Overall, PLM_Sol exhibited consistent and promising performance across both independent test set and experimental set, thereby making it well-suited for facilitating large-scale EOI studies. PLM_Sol is available as a standalone program and as an easy-to-use model at .

Bioinformatics

What problem does this paper attempt to address?

The paper primarily focuses on addressing the issue of predicting protein solubility in Escherichia coli (E. coli). Specifically, the research team made improvements to several key issues present in existing prediction models: 1. **Quality of the dataset**: Existing datasets suffer from being outdated, having unclear annotations, and data contamination. To address this, the researchers integrated data from multiple databases (such as TargetTrack, DNASU, eSOL, and PDB) and created an updated and higher-quality dataset called UESolDS through a series of rigorous data cleaning steps. 2. **Performance and usability of the models**: Some previous models are no longer available or require additional input information (such as protein secondary structure information), which limits their applicability. Additionally, the architecture of some models may not fully leverage the latest advancements in deep learning technology. Therefore, this study developed a new model named PLM_Sol, which combines multiple protein language models (PLMs) and different classification layers to improve prediction accuracy. 3. **Experimental validation**: To verify the effectiveness and generalization ability of PLM_Sol, the research team also conducted experimental tests on a set of internally synthesized protein samples belonging to different enzyme family types. The results showed that PLM_Sol demonstrated significant performance improvements on both independent test sets and experimental datasets. In summary, this paper aims to improve the accuracy of protein solubility prediction by enhancing the quality of the training dataset and utilizing advanced deep learning methods, thereby supporting large-scale enzyme screening research.

PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated protein solubility dataset

PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated Escherichia coli protein solubility dataset

ProtSolM: Protein Solubility Prediction with Multi-modal Features

Develop machine learning based predictive models for engineering protein solubility

Develop machine learning-based regression predictive models for engineering protein solubility

DeepSol: a deep learning framework for sequence‐based protein solubility prediction

GATSol, an enhanced predictor of protein solubility through the synergy of 3D structure graph and large language modeling

Prediction of protein solubility based on sequence physicochemical patterns and distributed representation information with DeepSoluE

DBA/2J (Mls‐1a) B‐cell differentiation in BALB.xid recipients

SoluProt: prediction of soluble protein expression in Escherichia coli

Efficient Inference, Training, and Fine-tuning of Protein Language Models

PLMC: Language Model of Protein Sequences Enhances Protein Crystallization Prediction

Does protein pretrained language model facilitate the prediction of protein–ligand interaction?

THPLM: a sequence-based deep learning framework for protein stability changes prediction upon point variations using pretrained protein language model

InstructPLM: Aligning Protein Language Models to Follow Protein Structure Instructions

S-PLM: Structure-aware Protein Language Model via Contrastive Learning between Sequence and Structure

Enhancing Predictions of Drug Solubility through Multidimensional Structural Characterization Exploitation

Protein–Sol: a web tool for predicting protein solubility from sequence

Optimizing Pharmacokinetic Property Prediction Based on Integrated Datasets and a Deep Learning Approach

Boosting the predictive performance with aqueous solubility dataset curation

A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding