Abstract:Toxicity emerges as a prominent challenge in the design of therapeutic peptides, causing the failure of numerous peptides during clinical trials. In 2013, our group developed ToxinPred, a computational method that has been extensively adopted by the scientific community for predicting peptide toxicity. In this paper, we propose a refined variant of ToxinPred that showcases improved reliability and accuracy in predicting peptide toxicity. Initially, we utilized a similarity/alignment-based approach employing BLAST to predict toxic peptides, which yielded satisfactory accuracy; however, the method suffered from inadequate coverage. Subsequently, we employed a motif-based approach using MERCI software to uncover specific patterns or motifs that are exclusively observed in toxic peptides. The search for these motifs in peptides allowed us to predict toxic peptides with a high level of specificity with poor sensitivity. To overcome the coverage limitations, we developed alignment-free methods using machine/deep learning techniques to balance sensitivity and specificity of prediction. Deep learning model (ANN - LSTM with fixed sequence length) developed using one-hot encoding achieved a maximum AUROC of 0.93 with MCC of 0.71 on an independent dataset. Machine learning model (extra tree) developed using compositional features of peptides achieved a maximum AUROC of 0.95 with MCC of 0.78. We also developed large language models and achieved maximum AUC of 0.93 using ESM2-t33. Finally, we developed hybrid or ensemble methods combining two or more methods to enhance performance. Our specific hybrid method, which combines a motif-based approach with a machine learning-based model, achieved a maximum AUROC of 0.98 with MCC 0.81 on an independent dataset. In this study, all models were trained and tested on 80 % of data using five-fold cross-validation and evaluated on the remaining 20 % of data called independent dataset. The evaluation of all methods on an independent dataset revealed that the method proposed in this study exhibited better performance than existing methods. To cater to the needs of the scientific community, we have developed a standalone software, pip package and web-based server ToxinPred3 (https://github.com/raghavagps/toxinpred3 and https://webs.iiitd.edu.in/raghava/toxinpred3/).

VISH-Pred: an ensemble of fine-tuned ESM models for protein toxicity prediction

ToxinPred 3.0: An improved method for predicting the toxicity of peptides

CSM-Toxin: A Web-Server for Predicting Protein Toxicity

Structure‐aware deep learning model for peptide toxicity prediction

pLMFPPred: a novel approach for accurate prediction of functional peptides integrating embedding from pre-trained protein language model and imbalanced learning

An ensemble method for predicting and designing of druggable proteins.

Improved Prediction Model of Protein and Peptide Toxicity by Integrating Channel Attention into a Convolutional Neural Network and Gated Recurrent Units

A New Robust Method for Predicting Hemolytic Toxicity from Peptide Se-quence

Enhancing missense variant pathogenicity prediction with protein language models using VariPred

TransEFVP: A Two-Stage Approach for the Prediction of Human Pathogenic Variants Based on Protein Sequence Embedding Fusion

VariPred: Enhancing Pathogenicity Prediction of Missense Variants Using Protein Language Models

Prediction of Aggregation Prone Regions in Proteins Using Deep Neural Networks and Their Suppression by Computational Design

ITP-Pred: an interpretable method for predicting, therapeutic peptides with fused features low-dimension representation

Prediction of Hemolytic Peptides and their Hemolytic Concentration (HC50)

Protein language models enable prediction of polyreactivity of monospecific, bispecific, and heavy-chain-only antibodies

AllerTrans: A Deep Learning Method for Predicting the Allergenicity of Protein Sequences

Embeddings from protein language models predict conservation and variant effects

Cross-protein transfer learning substantially improves disease variant prediction

Demonstration of the Sequence Alignment to Predict Across Species Susceptibility Tool for Rapid Assessment of Protein Conservation

Accurate prediction of functional effect of single amino acid variants with deep learning