Systematic Comparison and Comprehensive Evaluation of 80 Amino Acid Descriptors in Peptide QSAR Modeling
Peng Zhou,Qian Liu,Ting Wu,Qingqing Miao,Shuyong Shang,Heyi Wang,Zheng Chen,Shaozhou Wang,Heyan Wang,Peng Zhou,Qian Liu,Ting Wu,Qingqing Miao,Shuyong Shang,Heyi Wang,Zheng Chen,Shaozhou Wang,Heyan Wang
DOI: https://doi.org/10.1021/acs.jcim.0c01370
IF: 6.162
2021-03-12
Journal of Chemical Information and Modeling
Abstract:The peptide quantitative structure–activity relationship (QSAR), also known as the quantitative sequence–activity model (QSAM), has attracted much attention in the bio- and chemoinformatics communities and is a well developed computational peptidology strategy to statistically correlate the sequence/structure and activity/property relationships of functional peptides. Amino acid descriptors (AADs) are one of the most widely used methods to characterize peptide structures by decomposing the peptide into its residue building blocks and sequentially parametrizing each building block with a vector of amino acid principal properties. Considering that various AADs have been proposed over the past decades and new AADs are still emerging today, we herein query the following: is it necessary to develop so many AADs and do we need to continuously develop more new AADs? In this study, we exhaustively collect 80 published AADs and comprehensively evaluate their modeling performance (including fitting ability, internal stability, and predictive power) on 8 QSAR-oriented peptide sample sets (QPSs) by employing 2 sophisticated machine learning methods (MLMs), totally building and systematically comparing 1280 (80 AADs × 8 QPSs × 2 MLMs) peptide QSAR models. The following is revealed: (i) None of the AADs can work best on all or most peptide sets; an AAD usually performs well for some peptides but badly for others. (ii) Modeling performance is primarily determined by the peptide samples and then the MLMs used, while AADs have only a moderate influence on the performance. (iii) There is no essential difference between the modeling performances of different AAD types (physiochemical, topological, 3D-structural, etc.). (iv) Two random descriptors, which are separately generated randomly in standard normal distribution <i>N</i>(0, 1) and uniform distribution <i>U</i>(−1, +1), do not perform significantly worse than these carefully developed AADs. (v) A secondary descriptor, which carries major information involved in the 80 (primary) AADs, does not perform significantly better than these AADs. Overall, we conclude that since there are various AADs available to date and they already cover numerous amino acid properties, further development of new AADs is not an essential choice to improve peptide QSAR modeling; the traditional AAD methodology is believed to have almost reached the theoretical limit nowadays. In addition, the AADs are more likely to be a vector symbol but not informative data; they are utilized to mark and distinguish the 20 amino acids but do not really bring much original property information to these amino acids.The Supporting Information is available free of charge at <a class="ext-link" href="/doi/10.1021/acs.jcim.0c01370?goto=supporting-info">https://pubs.acs.org/doi/10.1021/acs.jcim.0c01370</a>.<b>(Figure S1)</b> Systematic histogram of QSAR metrics. <b>(Figure S2)</b> Systematic histogram of the mean ± s.e. values of QSAR metrics. <b>(Figure S3)</b> Systematic pairwise Euclidean distance between the mean values of QSAR metrics. <b>(Table S1)</b> Full list of 80 AADs. <b>(Tables S2–S9)</b> Full list of 8 QSAR-oriented peptide sample sets (<a class="ext-link" href="/doi/suppl/10.1021/acs.jcim.0c01370/suppl_file/ci0c01370_si_001.pdf">PDF</a>)This article has not yet been cited by other publications.
chemistry, multidisciplinary, medicinal,computer science, interdisciplinary applications, information systems