Abstract:In recent years, there has been an explosion of research on the application of deep learning to the prediction of various peptide properties, due to the significant development and market potential of peptides. Molecular dynamics has enabled the efficient collection of large peptide datasets, providing reliable training data for deep learning. However, the lack of systematic analysis of the peptide encoding, which is essential for AI-assisted peptide-related tasks, makes it an urgent problem to be solved for the improvement of prediction accuracy. To address this issue, we first collect a high-quality, colossal simulation dataset of peptide self-assembly containing over 62,000 samples generated by coarse-grained molecular dynamics (CGMD). Then, we systematically investigate the effect of peptide encoding of amino acids into sequences and molecular graphs using state-of-the-art sequential (i.e., RNN, LSTM, and Transformer) and structural deep learning models (i.e., GCN, GAT, and GraphSAGE), on the accuracy of peptide self-assembly prediction, an essential physiochemical process prior to any peptide-related applications. Extensive benchmarking studies have proven Transformer to be the most powerful sequence-encoding-based deep learning model, pushing the limit of peptide self-assembly prediction to decapeptides. In summary, this work provides a comprehensive benchmark analysis of peptide encoding with advanced deep learning models, serving as a guide for a wide range of peptide-related predictions such as isoelectric points, hydration free energy, etc.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to systematically evaluate and analyze the performance of advanced deep learning models in peptide self-assembly prediction through sequence encoding and graph encoding methods. Specifically, the paper focuses on the following aspects: 1. **Systematic Analysis of Peptide Encoding Methods**: - Peptides are short-chain molecules composed of amino acids, and their self-assembly process has important applications in biomedicine and materials science. However, existing studies lack a systematic analysis of peptide encoding methods, which limits the accuracy of peptide-related tasks (such as peptide self-assembly prediction). - To fill this gap, the paper collected a high-quality, large-scale peptide self-assembly simulation dataset containing over 62,000 samples, generated by coarse-grained molecular dynamics (CGMD). 2. **Comparison of Encoding Methods**: - The paper compares sequence-based encoding methods and graph-based encoding methods. Sequence encoding methods include Recurrent Neural Networks (RNN), Long Short-Term Memory networks (LSTM), Bidirectional LSTM (Bi-LSTM), and Transformer. Graph encoding methods include Graph Convolutional Networks (GCN), Graph Attention Networks (GAT), and GraphSAGE. - Through these encoding methods, the paper explores the performance of different deep learning models in peptide self-assembly prediction to determine the most effective encoding and modeling methods. 3. **Improving Prediction Accuracy**: - The goal of the paper is to find methods that can significantly improve the accuracy of peptide self-assembly prediction through systematic analysis and comparison. Specifically, the paper hopes to provide guidance for future peptide-related prediction tasks (such as isoelectric point, hydration free energy, etc.) through these analyses. ### Main Contributions 1. **Construction of a Large-Scale Dataset**: - The paper constructed a high-quality peptide self-assembly dataset containing over 62,000 samples, which is one of the largest peptide self-assembly simulation datasets to date. 2. **Systematic Evaluation of Encoding Methods**: - The paper systematically evaluated various sequence encoding and graph encoding methods in peptide self-assembly prediction, providing detailed performance comparisons. 3. **Selection of State-of-the-Art Models**: - Experimental results show that the Transformer performs best among sequence encoding methods, while GraphSAGE performs best among graph encoding methods. These two methods exhibit very close performance in peptide self-assembly prediction tasks. 4. **Guidance for Future Research**: - The analysis and results of the paper provide important references for future peptide-related prediction tasks, especially in selecting appropriate encoding and modeling methods. ### Conclusion Through systematic analysis and experiments, the paper demonstrates the superior performance of Transformer and GraphSAGE in peptide self-assembly prediction and provides valuable guidance for future peptide-related research. These results not only help improve the accuracy of peptide self-assembly prediction but also offer new insights for other peptide-related tasks.

Efficient Prediction of Peptide Self-assembly through Sequential and Graphical Encoding

Deep Learning Empowers the Discovery of Self‐Assembling Peptides with Over 10 Trillion Sequences

Learning the rules of peptide self-assembly through data mining with large language models

Enhancing Drug Peptide Sequence Prediction Using Multi-view Feature Fusion Learning

Deep-learning-based Prediction Framework for Protein-Peptide Interactions with Structure Generation Pipeline

AI in biomaterials discovery: generating self-assembling peptides with resource-efficient deep learning

Training Neural Network Models Using Molecular Dynamics Simulation Results to Efficiently Predict Cyclic Hexapeptide Structural Ensembles

NCPepFold: Accurate Prediction of Non-canonical Cyclic Peptide Structures via Cyclization Optimization with Multigranular Representation

Sequence-based Peptide Identification, Generation, and Property Prediction with Deep Learning: a Review

Deep learning for advancing peptide drug development: Tools and methods in structure prediction and design

Deep Reinforcement Learning for Modelling Protein Complexes

Discovery of Self-Assembling $π$-Conjugated Peptides by Active Learning-Directed Coarse-Grained Molecular Simulation

Peripheral ulcerative keratitis after Salmonella gastroenteritis.

Protein-DNA Binding Residues Prediction Using a Deep Learning Model with Hierarchical Feature Extraction

Co-modeling the Sequential and Graphical Routes for Peptide Representation Learning

Multiscale Simulations to Discover Self-Assembled Oligopeptides: A Benchmarking Study

Short‐Term Space Flight on Nitrogenous Compounds, Lipoproteins, and Serum Proteins

Protein Interaction Network Reconstruction Through Ensemble Deep Learning With Attention Mechanism

Deep Learning-Based Bioactive Therapeutic Peptide Generation and Screening

GNN-PT: Enhanced Prediction of Compound-protein Interactions by Integrating Protein Transformer

Modeling the language of life – Deep Learning Protein Sequences