PHIStruct: Improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings

Mark Edward M. Gonzales,Jennifer C. Ureta,Anish M.S. Shrestha
DOI: https://doi.org/10.1101/2024.08.24.609479
2024-08-24
Abstract:Recent computational approaches for predicting phage-host interaction have explored the use of sequence-only protein language models to produce embeddings of phage proteins without manual feature engineering. However, these embeddings do not directly capture protein structure information and structure-informed signals related to host specificity. We present PHIStruct, a multilayer perceptron that takes in structure-aware embeddings of receptor-binding proteins, generated via the structure-aware protein language model SaProt, and then predicts the host from among the ESKAPEE genera. Compared against recent tools, PHIStruct exhibits the best balance of precision and recall, with the highest and most stable F1 score across a wide range of confidence thresholds and sequence similarity settings. The margin in performance is most pronounced when the sequence similarity between the training and test sets drops below 40%, wherein, at a relatively high-confidence threshold of above 50%, PHIStruct presents a 7% to 9% increase in class-averaged F1 over machine learning tools that do not directly incorporate structure information, as well as a 5% to 6% increase over BLASTp. The data and source code for our experiments and analyses are available at https://github.com/bioinfodlsu/PHIStruct.
Bioinformatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to improve the accuracy of phage - host interaction prediction under the condition of low sequence similarity. Specifically, although the existing sequence - based protein language models can generate embedding representations (embeddings) of phage proteins, these embedding representations fail to directly capture protein structure information and structure signals related to host specificity. Therefore, in the case of low sequence similarity, the prediction performance of these models may be affected. To solve this problem, the author introduced PHIStruct, a deep - learning model that uses structure - aware protein embeddings to predict phage - host interactions. By incorporating protein structure information, PHIStruct can significantly improve the accuracy and stability of prediction in the low - sequence - similarity setting. ### Main contributions 1. **Dataset construction**: The author constructed a dataset containing 7,627 non - redundant receptor - binding proteins (RBPs), which are from 3,350 phages targeting ESKAPEE - genus hosts. 2. **Structure - aware embedding generation**: The author used the structure - aware protein language model SaProt to generate embedding representations of RBPs. 3. **Model training and evaluation**: The author trained a two - layer perceptron model, which takes the embedding representations generated by SaProt as input and predicts the host genus. The experimental results show that PHIStruct outperforms existing tools in the low - sequence - similarity setting, especially at high confidence thresholds, with an improvement of 7% - 9% in the F1 score. ### Conclusion PHIStruct significantly improves the performance of phage - host interaction prediction under the condition of low sequence similarity by introducing structure - aware protein embedding representations. This provides strong support for future phage therapy and drug - resistance research. ### Formula summary To evaluate the model performance, the author defined macro - precision, macro - recall, and macro - F1. These metrics are parameterized according to the confidence threshold \( k \) and the maximum training - test sequence similarity \( s \): \[ \text{MACRO - PRECISION}_{k,s}=\frac{1}{|C|}\sum_{c\in C}\frac{\text{TP}_{c,k,s}}{\text{TP}_{c,k,s}+\text{FP}_{c,k,s}} \] \[ \text{MACRO - RECALL}_{k,s}=\frac{1}{|C|}\sum_{c\in C}\frac{\text{TP}_{c,k,s}}{\text{TP}_{c,k,s}+\text{FN}_{c,k,s}} \] \[ \text{MACRO - F1}_{k,s}=\frac{1}{|C|}\sum_{c\in C}\frac{2\cdot\text{TP}_{c,k,s}}{\left(\text{TP}_{c,k,s}+\text{FP}_{c,k,s}\right)+\left(\text{TP}_{c,k,s}+\text{FN}_{c,k,s}\right)} \] where: - \( C \) is the set of ESKAPEE - class labels. - \( \text{TP}_{c,k,s} \), \( \text{TN}_{c,k,s} \), \( \text{FP}_{c,k,s} \) and \( \text{FN}_{c,k,s} \) represent the number of true positives, true negatives, false positives and false negatives of class \( c \) under the confidence threshold \( k \) and the maximum training - test sequence similarity \( s \), respectively. These formulas are used to evaluate the performance of different models under different conditions.