T4SEpp: a pipeline integrated with protein language models effectively predicting bacterial type IV secreted effectors
Yueming Hu,Yejun Wang,Xiaotian Hu,Haoyu Chao,Sida Li,Qinyang Ni,Yanyan Zhu,Yixue Hu,Ziyi Zhao,Ming Chen
DOI: https://doi.org/10.1101/2023.07.01.547179
2023-01-01
bioRxiv
Abstract:Many pathogenic bacteria use type IV secretion systems(T4SSs) to deliver effectors (T4SEs) into the cytoplasm of eukaryotic cells, causeing diseases. The identification of effectors is a crucial step in understanding the mechanisms of bacterial pathogenicity, but this remains a major challenge. In this study, we used the full-length embedding features generated by six pre-trained protein language models to train classifiers predicting T4SEs, and compared their performance. An integrated model T4SEpp was assembled by a module searching full-length, signal sequence and effector domain homologs of known T4SEs, a machine learning module based on the hand-crafted features extracted from the signal sequences, and the third module containing three best-performing protein language pre-trained models. T4SEpp outperformed the other state-of-the-art (SOTA) software tools, achieving ∼0.95 sensitivity at a high specificity of ∼0.99, based on the assessment of an independent testing dataset. Additionally, we performed a comprehensive search among 8,761 bacterial species, leading to the discovery of 227 species belonging to 3 phyla and 117 genera that possess T4SSs. Furthermore, leveraging the power of T4SEpp, we successfully identified a grand total of 12,622 plausible T4SEs. Overall, T4SEpp provides a better solution to assist in the identification of bacterial T4SEs, and facilitates studies of bacterial pathogenicity. T4SEpp is freely accessible at <https://bis.zju.edu.cn/T4SEpp>.
### Competing Interest Statement
The authors have declared no competing interest.