Efficient Prediction of Peptide Self-assembly through Sequential and Graphical Encoding

Zihan Liu,Jiaqi Wang,Yun Luo,Shuang Zhao,Wenbin Li,Stan Z. Li
2023-07-17
Abstract:In recent years, there has been an explosion of research on the application of deep learning to the prediction of various peptide properties, due to the significant development and market potential of peptides. Molecular dynamics has enabled the efficient collection of large peptide datasets, providing reliable training data for deep learning. However, the lack of systematic analysis of the peptide encoding, which is essential for AI-assisted peptide-related tasks, makes it an urgent problem to be solved for the improvement of prediction accuracy. To address this issue, we first collect a high-quality, colossal simulation dataset of peptide self-assembly containing over 62,000 samples generated by coarse-grained molecular dynamics (CGMD). Then, we systematically investigate the effect of peptide encoding of amino acids into sequences and molecular graphs using state-of-the-art sequential (i.e., RNN, LSTM, and Transformer) and structural deep learning models (i.e., GCN, GAT, and GraphSAGE), on the accuracy of peptide self-assembly prediction, an essential physiochemical process prior to any peptide-related applications. Extensive benchmarking studies have proven Transformer to be the most powerful sequence-encoding-based deep learning model, pushing the limit of peptide self-assembly prediction to decapeptides. In summary, this work provides a comprehensive benchmark analysis of peptide encoding with advanced deep learning models, serving as a guide for a wide range of peptide-related predictions such as isoelectric points, hydration free energy, etc.
Biomolecules,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to systematically evaluate and analyze the performance of advanced deep learning models in peptide self-assembly prediction through sequence encoding and graph encoding methods. Specifically, the paper focuses on the following aspects: 1. **Systematic Analysis of Peptide Encoding Methods**: - Peptides are short-chain molecules composed of amino acids, and their self-assembly process has important applications in biomedicine and materials science. However, existing studies lack a systematic analysis of peptide encoding methods, which limits the accuracy of peptide-related tasks (such as peptide self-assembly prediction). - To fill this gap, the paper collected a high-quality, large-scale peptide self-assembly simulation dataset containing over 62,000 samples, generated by coarse-grained molecular dynamics (CGMD). 2. **Comparison of Encoding Methods**: - The paper compares sequence-based encoding methods and graph-based encoding methods. Sequence encoding methods include Recurrent Neural Networks (RNN), Long Short-Term Memory networks (LSTM), Bidirectional LSTM (Bi-LSTM), and Transformer. Graph encoding methods include Graph Convolutional Networks (GCN), Graph Attention Networks (GAT), and GraphSAGE. - Through these encoding methods, the paper explores the performance of different deep learning models in peptide self-assembly prediction to determine the most effective encoding and modeling methods. 3. **Improving Prediction Accuracy**: - The goal of the paper is to find methods that can significantly improve the accuracy of peptide self-assembly prediction through systematic analysis and comparison. Specifically, the paper hopes to provide guidance for future peptide-related prediction tasks (such as isoelectric point, hydration free energy, etc.) through these analyses. ### Main Contributions 1. **Construction of a Large-Scale Dataset**: - The paper constructed a high-quality peptide self-assembly dataset containing over 62,000 samples, which is one of the largest peptide self-assembly simulation datasets to date. 2. **Systematic Evaluation of Encoding Methods**: - The paper systematically evaluated various sequence encoding and graph encoding methods in peptide self-assembly prediction, providing detailed performance comparisons. 3. **Selection of State-of-the-Art Models**: - Experimental results show that the Transformer performs best among sequence encoding methods, while GraphSAGE performs best among graph encoding methods. These two methods exhibit very close performance in peptide self-assembly prediction tasks. 4. **Guidance for Future Research**: - The analysis and results of the paper provide important references for future peptide-related prediction tasks, especially in selecting appropriate encoding and modeling methods. ### Conclusion Through systematic analysis and experiments, the paper demonstrates the superior performance of Transformer and GraphSAGE in peptide self-assembly prediction and provides valuable guidance for future peptide-related research. These results not only help improve the accuracy of peptide self-assembly prediction but also offer new insights for other peptide-related tasks.