Unbiased organism-agnostic and highly sensitive signal peptide predictor with deep protein language model

Junbo Shen,Qinze Yu,Shenyang Chen,Qingxiong Tan,Jingcheng Li,Yu Li
DOI: https://doi.org/10.1038/s43588-023-00576-2
2023-12-14
Abstract:Signal peptide (SP) is a short peptide located in the N-terminus of proteins. It is essential to target and transfer transmembrane and secreted proteins to correct positions. Compared with traditional experimental methods to identify signal peptides, computational methods are faster and more efficient, which are more practical for analyzing thousands or even millions of protein sequences, especially for metagenomic data. Here we present Unbiased Organism-agnostic Signal Peptide Network (USPNet), a signal peptide classification and cleavage site prediction deep learning method that takes advantage of protein language models. We propose to apply label distribution-aware margin loss to handle data imbalance problems and use evolutionary information of protein to enrich representation and overcome species information dependence.
Artificial Intelligence,Quantitative Methods
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two key problems in signal peptide (SP) prediction: 1. **Data imbalance problem**: In the signal peptide classification task, the number distribution of different types of signal peptides is extremely uneven. For example, the sample size of some types of signal peptides is very small, while other types have a large number of samples. This extreme data imbalance leads to poor performance of existing methods when dealing with minority classes. 2. **Dependence on species information problem**: Most existing signal peptide prediction tools rely on additional species information to improve prediction performance. However, in practical applications, such as when analyzing from metagenomic data, this additional species information is often difficult to obtain or simply does not exist. Therefore, a method that does not depend on additional species information is required to achieve broader applicability and higher robustness. To solve these problems, the authors proposed a deep - learning model named USPNet (Unbiased Organism - agnostic Signal Peptide Network). The main features of this model are as follows: - **Using protein language models**: By introducing protein language models such as multiple sequence alignment (MSA) and evolutionary scale modeling (ESM), USPNet can better capture the evolutionary and functional information of proteins, thereby improving the accuracy of prediction. - **Dealing with data imbalance**: The label distribution - aware margin loss (LDAM) combined with the class - balance loss is adopted to improve the generalization ability of minority classes. - **No need for additional species information**: USPNet only relies on amino acid sequences as input and does not require additional species information, making it more advantageous when dealing with protein data of unknown or complex sources. In addition, USPNet also demonstrated the ability to discover new signal peptides in large - scale metagenomic data, revealing 347 potential new signal peptide candidates. The sequences of these candidates have extremely low similarity (as low as 13%) to the known signal peptides in the training set, but a relatively high structural similarity (most TM - scores exceed 0.8). This indicates that USPNet can not only accurately predict known types of signal peptides, but also discover new signal peptides far from existing knowledge. Overall, USPNet provides an efficient, robust and general - purpose signal peptide prediction tool, especially suitable for processing large - scale, protein sequence data from complex sources.