PINDER: The protein interaction dataset and evaluation resource
Daniel Kovtun,Mehmet Akdel,Alexander Goncearenco,Guoqing Zhou,Graham Holt,David Baugher,Dejun Lin,Yusuf Adeshina,Thomas Castiglione,Xiaoyun Wang,Celine Marquet,Matt McPartlon,Tomas Geffner,Emanuele Rossi,Gabriele Corso,Hannes Stark,Zachary Carpenter,Emine Kucukbenli,Michael Bronstein,Luca Naef
DOI: https://doi.org/10.1101/2024.07.17.603980
2024-08-13
Abstract:Protein-protein interactions (PPIs) are fundamental to understanding biological processes and play a key role in therapeutic advancements. As deep-learning docking methods for PPIs gain traction, benchmarking protocols and datasets tailored for effective training and evaluation of their generalization capabilities and performance across real-world scenarios become imperative. Aiming to overcome limitations of existing approaches, we introduce pinder, a comprehensive annotated dataset that uses structural clustering to derive non-redundant interface-based data splits and includes holo (bound), apo (unbound), and computationally predicted structures. pinder consists of 2,319,564 dimeric PPI systems (and up to 25 million augmented PPIs) and 1,955 high-quality test PPIs with interface data leakage removed. Additionally, pinder provides a test subset with 180 dimers for comparison to AlphaFold-Multimer without any interface leakage with respect to its training set. Unsurprisingly, the pinder benchmark reveals that the performance of existing docking models is highly overestimated when evaluated on leaky test sets. Most importantly, by retraining DiffDock-PP on pinder interface-clustered splits, we show that interface cluster-based sampling of the training split, along with the diverse and less leaky validation split, leads to strong generalization improvements.
Bioinformatics