UniproLcad: Accurate Identification of Antimicrobial Peptide by Fusing Multiple Pre-Trained Protein Language Models

Xiao Wang,Zhou Wu,Rong Wang,Xu Gao
DOI: https://doi.org/10.3390/sym16040464
2024-04-11
Symmetry
Abstract:Antimicrobial peptides (AMPs) are vital components of innate immunotherapy. Existing approaches mainly rely on either deep learning for the automatic extraction of sequence features or traditional manual amino acid features combined with machine learning. The peptide sequence contains symmetrical sequence motifs or repetitive amino acid patterns, which may be related to the function and structure of the peptide. Recently, the advent of large language models has significantly boosted the representational power of sequence pattern features. In light of this, we present a novel AMP predictor called UniproLcad, which integrates three prominent protein language models—ESM-2, ProtBert, and UniRep—to obtain a more comprehensive representation of protein features. UniproLcad utilizes deep learning networks, encompassing the bidirectional long and short memory network (Bi-LSTM) and one-dimensional convolutional neural networks (1D-CNN), while also integrating an attention mechanism to enhance its capabilities. These deep learning frameworks, coupled with pre-trained language models, efficiently extract multi-view features from antimicrobial peptide sequences and assign attention weights to them. Through ten-fold cross-validation and independent testing, UniproLcad demonstrates competitive performance in the field of antimicrobial peptide identification. This integration of diverse language models and deep learning architectures enhances the accuracy and reliability of predicting antimicrobial peptides, contributing to the advancement of computational methods in this field.
multidisciplinary sciences
What problem does this paper attempt to address?
The aim of this paper is to develop a more accurate and comprehensive method for predicting Antimicrobial Peptides (AMPs). Specifically, the researchers address the limitations of existing methods in handling AMP prediction, such as the inability to fully capture the diversity of data distribution and the incomplete representation of single protein language models, by proposing a new method called UniproLcad. The main contributions of UniproLcad include: 1. **Integration of multiple protein language models**: The researchers combined three mainstream protein language models—ESM-2, ProtBert, and UniRep—to obtain a more comprehensive representation of protein features. These models are based on different architectures (Transformer, BERT, and RNN), thus capturing information from peptide sequences from multiple perspectives. 2. **Utilization of deep neural network structures**: To further enhance model performance, UniproLcad employs Bidirectional Long Short-Term Memory networks (Bi-LSTM) and one-dimensional Convolutional Neural Networks (1D-CNN), and uses attention mechanisms to strengthen its ability to focus on important features. 3. **Addressing symmetry issues in peptide sequences**: Symmetry patterns that may exist in peptide sequences are crucial for the function and structure of peptides. By using Bi-LSTM and 1D-CNN, UniproLcad can effectively identify and extract these symmetrical features. 4. **Improving prediction accuracy and generalization ability**: Through performance evaluation on 10-fold cross-validation and independent test sets, UniproLcad demonstrated its competitiveness in the field of AMP prediction, showing higher accuracy and reliability compared to existing methods. In summary, UniproLcad is a novel AMP prediction tool that integrates multiple protein language models and deep learning techniques, aiming to overcome the shortcomings of existing methods and provide more accurate and reliable prediction results.