Abstract:Protein solubility prediction is useful for the careful selection of highly effective candidate proteins for drug development. In recombinant proteins synthesis, solubility prediction is valuable for optimizing key protein characteristics, including stability, functionality, and ease of purification. It contains valuable information about potential biomarkers or therapeutic targets and helps in early forecasting of neurodegenerative diseases, cancer, and cardiovascular disorders. Traditional wet-lab experimental protein solubility prediction approaches are error-prone, time-consuming, and costly. Researchers harnessed the competence of Artificial Intelligence approaches for replacing experimental approaches with computational predictors. These predictors inferred the solubility of proteins by analyzing amino acids distributions in raw protein sequences. There is still a lot of room for the development of robust computational predictors because existing predictors remain fail in extracting comprehensive discriminative distribution of amino acids. To more precisely discriminate soluble proteins from insoluble proteins, this paper presents ProSol-Multi predictor that makes use of a novel MLCDE encoder and Random Forest classifier. MLCDE encoder transforms protein sequences into informative statistical vectors by capturing amino acids multi-level correlation and discriminative distribution within raw protein sequences. The performance of proposed encoder is evaluated against 56 existing protein sequence encoding methods on a widely used protein solubility prediction benchmark dataset under two different experimental settings namely intrinsic and extrinsic. Intrinsic evaluation reveals that from all sequence encoders, proposed MLCDE encoder manages to generate non-overlapping clusters of soluble and insoluble classes. In extrinsic evaluation, 10 machine learning classifiers achieve better performance with proposed MLCDE encoder as compared to 56 existing protein sequence encoders. Moreover, across 4 public benchmark datasets, proposed ProSol-Multi predictor outshines 20 existing predictors by an average accuracy of 3%, MCC and AU-ROC of 2%. ProSol-Multi interactive web application is available at https://sds_genetic_analysis.opendfki.de/ProSol-Multi.

Develop machine learning based predictive models for engineering protein solubility

Develop machine learning-based regression predictive models for engineering protein solubility

Prediction of Solubility of Proteins in Escherichia coli Based on Functional and Structural Features Using Machine Learning Methods

Improve Protein Solubility and Activity based on Machine Learning Models

PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated Escherichia coli protein solubility dataset

ProtSolM: Protein Solubility Prediction with Multi-modal Features

Prediction of protein solubility based on sequence physicochemical patterns and distributed representation information with DeepSoluE

PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated protein solubility dataset

Deep learning framework DNN with conditional WGAN for protein solubility prediction

Enhancing Protein Solubility Via Glycosylation: from Chemical Synthesis to Machine Learning Predictions

Evaluation of Machine Learning Models for Aqueous Solubility Prediction in Drug Discovery

Will we ever be able to accurately predict solubility?

GATSol, an enhanced predictor of protein solubility through the synergy of 3D structure graph and large language modeling

DeepSol: a deep learning framework for sequence‐based protein solubility prediction

SoluProt: prediction of soluble protein expression in Escherichia coli

ProSol-multi: Protein solubility prediction via amino acids multi-level correlation and discriminative distribution

Protein–Sol: a web tool for predicting protein solubility from sequence

Revisiting the Application of Machine Learning Approaches in Predicting Aqueous Solubility

PON-Sol2: Prediction of Effects of Variants on Protein Solubility

Improving Protein Solubility and Activity by Introducing Small Peptide Tags Designed with Machine Learning Models

Towards the prediction of drug solubility in binary solvent mixtures at various temperatures using machine learning