Abstract:Protein solubility prediction is useful for the careful selection of highly effective candidate proteins for drug development. In recombinant proteins synthesis, solubility prediction is valuable for optimizing key protein characteristics, including stability, functionality, and ease of purification. It contains valuable information about potential biomarkers or therapeutic targets and helps in early forecasting of neurodegenerative diseases, cancer, and cardiovascular disorders. Traditional wet-lab experimental protein solubility prediction approaches are error-prone, time-consuming, and costly. Researchers harnessed the competence of Artificial Intelligence approaches for replacing experimental approaches with computational predictors. These predictors inferred the solubility of proteins by analyzing amino acids distributions in raw protein sequences. There is still a lot of room for the development of robust computational predictors because existing predictors remain fail in extracting comprehensive discriminative distribution of amino acids. To more precisely discriminate soluble proteins from insoluble proteins, this paper presents ProSol-Multi predictor that makes use of a novel MLCDE encoder and Random Forest classifier. MLCDE encoder transforms protein sequences into informative statistical vectors by capturing amino acids multi-level correlation and discriminative distribution within raw protein sequences. The performance of proposed encoder is evaluated against 56 existing protein sequence encoding methods on a widely used protein solubility prediction benchmark dataset under two different experimental settings namely intrinsic and extrinsic. Intrinsic evaluation reveals that from all sequence encoders, proposed MLCDE encoder manages to generate non-overlapping clusters of soluble and insoluble classes. In extrinsic evaluation, 10 machine learning classifiers achieve better performance with proposed MLCDE encoder as compared to 56 existing protein sequence encoders. Moreover, across 4 public benchmark datasets, proposed ProSol-Multi predictor outshines 20 existing predictors by an average accuracy of 3%, MCC and AU-ROC of 2%. ProSol-Multi interactive web application is available at https://sds_genetic_analysis.opendfki.de/ProSol-Multi.

Using the concept of Chou's pseudo amino acid composition to predict protein solubility: an approach with entropies in information theory.

Predicting the Protein Solubility by Integrating Chaos Games Representation and Entropy in Information Theory

Predicting Protein Solubility With A Hybrid Approach By Pseudo Amino Acid Composition

Predicting Protein Solubility by the General Form of Chou’s Pseudo Amino Acid Composition: Approached from Chaos Game Representation and Fractal Dimension

Interconnection Between the Protein Solubility and Amino Acid and Dipeptide Compositions

Prediction of Solubility of Proteins in Escherichia coli Based on Functional and Structural Features Using Machine Learning Methods

Prediction of Functional Class of Proteins and Peptides Irrespective of Sequence Homology by Support Vector Machines.

Prediction of protein solubility based on sequence physicochemical patterns and distributed representation information with DeepSoluE

Prediction of Protein Solubility Based on Sequence Feature Fusion and DDcCNN.

Enhancing Protein Solubility Prediction Through Pre-trained Language Models and Graph Convolutional Neural Networks

Discrimination of soluble and aggregation-prone proteins based on sequence information.

Develop machine learning-based regression predictive models for engineering protein solubility

Using the concept of Chou’s pseudo amino acid composition to predict protein subcellular localization: an approach by incorporating evolutionary information and von Neumann entropies

ProSol-multi: Protein solubility prediction via amino acids multi-level correlation and discriminative distribution

ProtSolM: Protein Solubility Prediction with Multi-modal Features

Develop machine learning based predictive models for engineering protein solubility

Protein–Sol: a web tool for predicting protein solubility from sequence

Prediction of Protein Homo-Oligomer Types by Pseudo Amino Acid Composition: Approached with an Improved Feature Extraction and Naive Bayes Feature Fusion.

GATSol, an enhanced predictor of protein solubility through the synergy of 3D structure graph and large language modeling

Support Vector Machines for Predicting Protein Homo- Oligomers by Incorporating Pseudo-Amino Acid Composition #

Prediction of Protein Secondary Structure Content by Using the Concept of Chou'S Pseudo Amino Acid Composition and Support Vector Machine