PIPENN-EMB: ensemble net and protein embeddings generalise protein interface prediction beyond homology

David P.G. Thomas,Carlos M. Garcia Fernandez,Reza Haydarlou,K. Anton Feenstra
DOI: https://doi.org/10.1101/2024.10.31.621117
2024-11-02
Abstract:Protein interactions are crucial for understanding biological functions and disease mechanisms, but predicting these remains a complex task in computational biology. Increasingly, Deep Learning models are having success in interface prediction. This study presents PIPENN-EMB which explores the added value of using embeddings from the ProtT5-XL protein language model. Our results show substantial improvement over the previously published PIPENN model for protein interaction interface prediction, reaching an MCC of 0.313 vs. 0.249, and AUC-ROC 0.800 vs. 0.755 on the BIO_DL_TE test set. We furthermore show that these embeddings cover a broad range of 'hand-crafted' protein features in ablation studies. PIPENN-EMB reaches state-of-the-art performance on the ZK448 dataset for protein-protein interface prediction. We showcase predictions on 25 resistance-related proteins from Mycobacterium tuberculosis. Furthermore, whereas other state-of-the-art sequence-based methods perform worse for proteins that have little recognisable homology in their training data, PIPENN-EMB generalises to remote homologs, yielding stable AUC-ROC across all three test sets with less than 30% sequence identity to the training dataset, and even to proteins with less than 15% sequence identity.
Bioinformatics
What problem does this paper attempt to address?