Improved prediction of anti-angiogenic peptides based on machine learning models and comprehensive features from peptide sequences

Yun-Chen Lee,Jen-Chieh Yu,Kuan Ni,Yu-Chuan Lin,Ching-Tai Chen
DOI: https://doi.org/10.1038/s41598-024-65062-9
IF: 4.6
2024-06-24
Scientific Reports
Abstract:Angiogenesis is a key process for the proliferation and metastatic spread of cancer cells. Anti-angiogenic peptides (AAPs), with the capability of inhibiting angiogenesis, are promising candidates in cancer treatment. We propose AAPL, a sequence-based predictor to identify AAPs with machine learning models of improved prediction accuracy. Each peptide sequence was transformed to a vector of 4335 numeric values according to 58 different feature types, followed by a heuristic algorithm for feature selection. Next, the hyperparameters of six machine learning models were optimized with respect to the feature subset. We considered two datasets, one with entire peptide sequences and the other with 15 amino acids from peptide N-termini. AAPL achieved Matthew's correlation coefficients of 0.671 and 0.756 for independent tests based on the two datasets, respectively, outperforming existing predictors by a range of 5.3% to 24.6%. Further analyses show that AAPL yields higher prediction accuracy for peptides with more hydrophobic residues, and fewer hydrophilic and charged residues. The source code of AAPL is available at https://github.com/yunzheng2002/Anti-angiogenic.
multidisciplinary sciences
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the prediction accuracy of anti - angiogenic peptides (AAPs). Specifically, the authors proposed an AAP prediction method based on machine - learning models and comprehensive feature extraction - AAPL (Anti - angiogenic Peptide Predictor based on Learning) to identify peptides with the ability to inhibit angiogenesis, thereby providing potential drug candidates for cancer treatment. ### Problem Background 1. **Angiogenesis and Cancer** - Angiogenesis is a crucial process for cancer cell proliferation and metastasis, which promotes the supply of oxygen and nutrients and the excretion of waste. - Anti - angiogenic peptides (AAPs) can inhibit angiogenesis and show great potential in cancer treatment. 2. **Limitations of Existing Methods** - Experimental techniques for discovering and optimizing AAPs are both time - consuming and expensive. - Although existing computational methods have made some progress, they have deficiencies in feature selection and machine - learning model optimization, resulting in the need to improve prediction accuracy. ### Core Objectives of the Paper - **Improve Prediction Accuracy**: Improve the accuracy of AAP prediction by introducing a more comprehensive feature set and optimizing the machine - learning model. - **Systematic Feature Selection**: Use heuristic algorithms for feature selection to ensure the effectiveness and optimality of the selected feature subset. - **Optimize Machine - Learning Model**: Optimize the hyper - parameters of six different machine - learning models to further improve the prediction performance. ### Main Contributions - **Extension of Feature Set**: Considered 58 different types of features, including amino acid composition (AAC), dipeptide composition (DPC), pseudo - amino acid composition (PseAAC), etc., generating a total of 4,335 numerical features. - **Heuristic Feature Selection**: Use the Boruta procedure for feature ranking and determine the best feature subset through iterative five - fold cross - validation. - **Model Optimization**: Use the Optuna tool to optimize the hyper - parameters of six machine - learning models such as support vector machines (SVM), linear discriminant analysis (LDA), and random forests (RF). - **Significant Performance Improvement**: On two independent test datasets, the Matthew's correlation coefficient (MCC) of AAPL reached 0.671 and 0.756 respectively, significantly outperforming existing prediction methods. ### Conclusion By introducing a comprehensive feature set and an optimized machine - learning model, AAPL significantly improves the prediction accuracy of anti - angiogenic peptides, providing a powerful tool and support for cancer treatment. In addition, the study also found that peptides containing more hydrophobic residues and fewer hydrophilic and charged residues are more likely to be accurately predicted, which provides a direction for further research and improvement. ### Formula Summary - **Normalization Formula** \[ y_i=\frac{x_i - \text{Median}(X)}{Q_3(X)-Q_1(X)} \] where \( Q_3(X) \) and \( Q_1(X) \) represent the third quartile and the first quartile of feature \( X \), respectively. - **Best MCC Calculation** \[ \text{Best\_MCC}_N = \max_i(\text{MCC}_{N_i}) \] \[ \text{BFS}=F_{S_j}, \quad \text{where} \quad \text{Best\_MCC}_j=\max_k(\text{Best\_MCC}_k) \] - **Evaluation Metrics** \[ \text{Accuracy (Acc)}=\frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}} \] \[ \text{Precision (Pre)}=\frac{\text{TP}