Abstract:Cell-penetrating peptides (CPPs) are short chains of amino acids that have shown remarkable potential to cross the cell membrane and deliver coupled therapeutic cargoes into cells. Designing and testing different CPPs to target specific cells or tissues is crucial to ensure high delivery efficiency and reduced toxicity. However, in vivo / in vitro testing of various CPPs can be both time-consuming and costly, which has led to interest in computational methodologies, such as Machine Learning (ML) approaches, as faster and cheaper methods for CPP design and uptake prediction. However, most ML models developed to date focus on classification rather than regression techniques, because of the lack of informative quantitative uptake values. To address these challenges, we developed POSEIDON, an open-access and up-to-date curated database that provides experimental quantitative uptake values for over 2,300 entries and physicochemical properties of 1,315 peptides. POSEIDON also offers physicochemical properties, such as cell line, cargo, and sequence, among others. By leveraging this database along with cell line genomic features, we processed a dataset of over 1,200 entries to develop an ML regression CPP uptake predictor. Our results demonstrated that POSEIDON accurately predicted peptide cell line uptake, achieving a Pearson correlation of 0.87, Spearman correlation of 0.88, and r 2 score of 0.76, on an independent test set. With its comprehensive and novel dataset, along with its potent predictive capabilities, the POSEIDON database and its associated ML predictor signify a significant leap forward in CPP research and development. The POSEIDON database and ML Predictor are available for free and with a user-friendly interface at https://moreiralab.com/resources/poseidon/, making them valuable resources for advancing research on CPP-related topics. Scientific Contribution Statement: Our research addresses the critical need for more efficient and cost-effective methodologies in Cell-Penetrating Peptide (CPP) research. We introduced POSEIDON, a comprehensive and freely accessible database that delivers quantitative uptake values for over 2,300 entries, along with detailed physicochemical profiles for 1,315 peptides. Recognizing the limitations of current Machine Learning (ML) models for CPP design, our work leveraged the rich dataset provided by POSEIDON to develop a highly accurate ML regression model for predicting CPP uptake.

StackCPPred: a stacking and pairwise energy content-based prediction of cell-penetrating peptides and their uptake efficiency

TargetCPP: accurate prediction of cell-penetrating peptides from optimized multi-scale features using gradient boost decision tree

Prediction of cell penetrating peptides and their uptake efficiency using random forest‐based feature selections

CPPred-FL: a sequence-based predictor for large-scale identification of cell-penetrating peptides by feature representation learning

PractiCPP: a deep learning approach tailored for extremely imbalanced datasets in cell-penetrating peptide prediction

Predicting cell-penetrating peptides using machine learning algorithms and navigating in their chemical space

SiameseCPP: a sequence-based Siamese network to predict cell-penetrating peptides by contrastive learning

StackDPPred: Multiclass Prediction of Defensin Peptides using Stacked Ensemble Learning with Optimized Features

StackCPA: A stacking model for compound-protein binding affinity prediction based on pocket multi-scale features

Algorithm for Predicting the Properties of Peptide Sequences

POSEIDON: Peptidic Objects SEquence-based Interaction with cellular DOmaiNs: a new database and predictor

Prediction and analysis of cell-penetrating peptides using pseudo-amino acid composition and random forest models

mACPpred 2.0: Stacked Deep Learning for Anticancer Peptide Prediction with Integrated Spatial and Probabilistic Feature Representations

RPI-MDLStack: Predicting RNA–protein interactions through deep learning with stacking strategy and LASSO

Prediction of Plant Resistance Proteins Based on Pairwise Energy Content and Stacking Framework

Improving protein-protein interactions prediction accuracy using XGBoost feature selection and stacked ensemble classifier

LightCPPgen: An Explainable Machine Learning Pipeline for Rational Design of Cell Penetrating Peptides

ACP-CapsPred: an explainable computational framework for identification and functional prediction of anticancer peptides based on capsule network

Using a stacked ensemble learning framework to predict modulators of protein–protein interactions

Integrated Computational Pipeline for the High-Throughput Discovery of Cell Adhesion Peptides.

SSCpred: Single-Sequence-Based Protein Contact Prediction Using Deep Fully Convolutional Network.