Abstract:Predicting protein-protein interactions (PPIs) is vital for elucidating fundamental biology, designing peptide therapeutics, and for high-throughput protein annotation. This is particularly relevant in the current biotechnology landscape characterized by the proliferation of protein generative models, which necessitate a high-throughput and generalized PPI predictor for proteins regardless of conventional motifs or known biological functions. Our work addresses this need and provides strong evidence of the utility and reliability of protein language models (pLMs) in learning the PPI objective. We demonstrated that with the use of a sizable balanced dataset, pLMs achieve state-of-the-art performance metrics in PPI prediction on diverse proteins. To generate a dataset that allows for the approximation of these conditions, we implemented a novel synthetic data generation scheme to augment BIOGRID and Negatome datasets. The enhancement of these datasets was then used to fine-tune ProtBERT for PPI prediction to develop a model that we call SYNTERACT (SYNThetic data-driven protein-protein intERACtion Transformer). Our results are compelling, demonstrating 92% accuracy on validated positive and negative interacting pairs derived from 50 different organisms, all of which were excluded from the training phase. In addition to the high metrics, secondary analysis revealed that our synthetic negative data was able to successfully mimic actual negative samples, further reinforcing the integrity of synthetic data additions to PPI datasets. Another notable discovery was the ease in which previously existing PPI datasets could be predicted with simplistic features, calling into question if they can actually inform PPI prediction. We find that the subcellular compartment bias inherent to the compilation of these datasets is learnable with deep learning methods and demonstrate that our approach is not burdened by this disadvantage.

Does protein pretrained language model facilitate the prediction of protein–ligand interaction?

PLM-interact: extending protein language models to predict protein-protein interactions

Interpretable improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein

Multi-PLI: interpretable multi‐task deep learning model for unifying protein–ligand interaction datasets

Multimodal Protein-Ligand Contrastive Pretraining for Effective and Efficient Drug Discovery

ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction

PreDBP-PLMs: Prediction of DNA-binding proteins based on pre-trained protein language models and convolutional neural networks

ProtT3: Protein-to-Text Generation for Text-based Protein Understanding

Exploring evolution-aware & -free protein language models as protein function predictors

Natural Language Processing Methods for the Study of Protein-Ligand Interactions

Protein language model-embedded geometric graphs power inter-protein contact prediction

PLMC: Language Model of Protein Sequences Enhances Protein Crystallization Prediction

Protein language model embedded geometric graphs power inter-protein contact prediction

InstructPLM: Aligning Protein Language Models to Follow Protein Structure Instructions

Efficient Inference, Training, and Fine-tuning of Protein Language Models

ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing

From PSSM to Pre-Trained Language Models

THPLM: a sequence-based deep learning framework for protein stability changes prediction upon point variations using pretrained protein language model

Protein-Protein Interaction Prediction is Achievable with Large Language Models

Learning immune receptor representations with protein language models