BindingGYM: A Large-Scale Mutational Dataset Toward Deciphering Protein-Protein Interactions

Wei Lu,Jixian Zhang,Ming Gu,Shuangjia Zheng

DOI: https://doi.org/10.1101/2024.12.03.626712

2024-12-07

Abstract:Protein-protein interactions are crucial for drug discovery and understanding biological mechanisms. Despite significant advances in predicting the structures of protein complexes, led by AlphaFold3, determining the strength of these interactions accurately remains a challenge. Traditional low-throughput experimental methods do not generate sufficient data for comprehensive benchmarking or training deep learning models. Deep mutational scanning (DMS) experiments provide rich, high-throughput data; however, they are often used incompletely, neglecting to consider the binding partners, and on a per-study basis without assessing the generalization capabilities of fine-tuned models across different assays. To address these limitations, we collected over ten million raw DMS data points and refined them to half a million high-quality points from twenty-five assays, focusing on protein-protein interactions. We intentionally excluded non-PPI DMS data pertaining to intrinsic protein properties, such as fluorescence or catalytic activity. Our dataset meticulously pairs binding energies with the sequences and structures of all interacting partners using a comprehensive pipeline, recognizing that interactions inherently involve at least two proteins. This curated dataset serves as a foundation for benchmarking and training the next generation of deep learning models focused on protein-protein interactions, thereby opening the door to a plethora of high-impact applications including understanding cellular networks and advancing drug target discovery and development.

Biophysics

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the prediction of the binding strength of protein - protein interactions (PPI). Despite significant progress in predicting the structures of protein complexes, such as the contribution of AlphaFold3, accurately predicting the strength of these interactions remains a challenge. Traditional low - throughput experimental methods cannot generate sufficient data for comprehensive benchmarking or training deep - learning models. Although high - throughput deep mutational scanning (DMS) experiments provide abundant data, these data are usually incompletely used, binding partners are ignored, and the generalization ability of fine - tuned models in different trials has not been evaluated. To solve these problems, the authors of the paper collected more than 10 million raw DMS data points and refined them into 500,000 high - quality data points, covering 25 trials and focusing on protein - protein interactions. By excluding non - PPI DMS data related to intrinsic protein properties, such as fluorescence or catalytic activity, this dataset aims to provide a basis for benchmarking and training the next - generation deep - learning models, thus opening the door to a series of high - impact applications such as understanding cellular networks and promoting drug target discovery and development.

BindingGYM: A Large-Scale Mutational Dataset Toward Deciphering Protein-Protein Interactions

Exploring the Binding Mechanism Between Human Profilin (PFN1) and Polyproline-10 Through Binding Mode Screening

AlphaFold3, a secret sauce for predicting mutational effects on protein-protein interactions

ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction

Deep Geometric Representations for Modeling Effects of Mutations on Protein-Protein Binding Affinity.

Binding Affinity Prediction with 3D Machine Learning: Training Data and Challenging External Testing

DeepBindGCN: Integrating Molecular Vector Representation with Graph Convolutional Neural Networks for Protein–Ligand Interaction Prediction

SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction

A High-Quality Data Set of Protein-Ligand Binding Interactions Via Comparative Complex Structure Modeling

PDBBind Optimization to Create a High-Quality Protein-Ligand Binding Dataset for Binding Affinity Prediction

Machine learning methods for protein-protein binding affinity prediction in protein design

GGL-PPI: Geometric Graph Learning to Predict Mutation-Induced Binding Free Energy Changes

ProAffinity-GNN: A Novel Approach to Structure-based Protein-Protein Binding Affinity Prediction via a Curated Dataset and Graph Neural Networks

Multi-PLI: interpretable multi‐task deep learning model for unifying protein–ligand interaction datasets

DrugMGR: a deep bioactive molecule binding method to identify compounds targeting proteins

DDMut-PPI: predicting effects of mutations on protein–protein interactions using graph-based deep learning

DynamicBind: predicting ligand-specific protein-ligand complex structure with a deep equivariant generative model

GEMS: A Generalizable GNN Framework For Protein-Ligand Binding Affinity Prediction Through Robust Data Filtering and Language Model Integration

A new paradigm for applying deep learning to protein–ligand interaction prediction

Predicting mutational effects on protein-protein binding via a side-chain diffusion probabilistic model

Learning to design protein-protein interactions with enhanced generalization