PLM-interact: extending protein language models to predict protein-protein interactions

Dan Liu,Francesca Young,Kieran D. Lamb,Adalberto Claudio Quiros,Alexandrina Pancheva,Crispin Miller,Craig Macdonald,David L Robertson,Ke Yuan
DOI: https://doi.org/10.1101/2024.11.05.622169
2024-11-07
Abstract:Computational prediction of protein structure from amino acid sequences alone has been achieved with unprecedented accuracy, yet the prediction of protein-protein interactions (PPIs) remains an outstanding challenge. Here we assess the ability of protein language models (PLMs), routinely applied to protein folding, to be retrained for PPI prediction. Existing PPI prediction models that exploit PLMs use a pre-trained PLM feature set, ignoring that the proteins are physically interacting. Our novel method, PLM-interact, goes beyond a single protein, jointly encoding protein pairs to learn their relationships, analogous to the next-sentence prediction task from natural language processing. This approach provides a significant improvement in performance: Trained on human-human PPIs, PLM-interact predicts mouse, fly, worm, E. coli and yeast PPIs, achieving between 16-28% improvements in AUPR compared with state-of-the-art PPI models. Additionally, it can detect changes that disrupt or cause PPIs and be applied to virus-host PPI prediction. Our work demonstrates that large language models can be extended to learn the intricate relationships among biomolecules from sequences alone.
Bioinformatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the prediction of protein - protein interactions (PPIs). Although predicting protein structures from amino acid sequences alone has achieved unprecedented accuracy, the prediction of PPIs remains an unsolved challenge. Existing PPI prediction models usually utilize pre - trained protein language models (PLMs), but these models mainly focus on the sequences of individual proteins during training and ignore the physical interactions between proteins. Therefore, these models perform poorly in predicting PPIs, especially in cross - species prediction. To overcome this problem, the paper proposes a new method - **PLM - interact**, which extends the existing PLMs to enable them to directly model PPIs. Specifically, PLM - interact learns the relationships between protein pairs by jointly encoding them, similar to the next - sentence prediction task in natural language processing. This method not only improves the performance of PPI prediction but also can detect mutations that cause or disrupt PPI, and can be applied to the prediction of virus - host PPI. ### Main contributions 1. **Improved PPI prediction performance**: PLM - interact shows a significant performance improvement in PPI prediction across multiple species. In particular, after being trained on human - human PPI data, in the PPI prediction of mice, fruit flies, nematodes, Escherichia coli and yeast, the area under the precision - recall curve (AUPR) is increased by 16 - 28% compared to the best existing models. 2. **Mutation effect prediction**: PLM - interact can accurately predict the impact of mutations on PPI, whether it is causing a new PPI or disrupting an existing one. 3. **Virus - host PPI prediction**: PLM - interact also performs well in the virus - human PPI prediction task. Compared to other models, it improves by 5.7%, 10.9% and 11.9% respectively in metrics such as AUPR, F1 and MCC. ### Method overview - **Model architecture**: PLM - interact is extended and fine - tuned based on the pre - trained ESM - 2 model. Two main extensions are introduced: 1. **Longer sequence length**: Allows for the processing of longer amino acid sequence pairs. 2. **"Next - sentence" prediction task**: Trains the model to recognize whether protein pairs interact. - **Training process**: Trained using human PPI data, and the model is jointly optimized through the masked language modeling task and the binary classification task. The specific loss function is: \[ \mathcal{L}=\alpha \mathcal{L}_{\text{MLM}}+\beta \mathcal{L}_{\text{CLS}} \] where \(\mathcal{L}_{\text{MLM}}\) is the masked language modeling loss, \(\mathcal{L}_{\text{CLS}}\) is the classification loss, and \(\alpha\) and \(\beta\) are weight parameters. ### Experimental results - **Cross - species prediction**: PLM - interact shows a significant performance improvement in the PPI prediction of mice, fruit flies, nematodes, yeast and Escherichia coli. - **Mutation effect prediction**: Verified by experiments, PLM - interact can accurately predict the impact of mutations on PPI. - **Virus - host PPI prediction**: In the virus - human PPI prediction task, the performance of PLM - interact is better than that of existing models. In conclusion, this paper significantly improves the performance of PPI prediction by introducing the PLM - interact method and shows its application potential in mutation effect prediction and virus - host PPI prediction.