PLINDER: The protein-ligand interactions dataset and evaluation resource
Janani Durairaj,Yusuf Adeshina,Zhonglin Cao,Xuejin Zhang,Vladas Oleinikovas,Thomas Duignan,Zachary McClure,Xavier Robin,Gabriel Studer,Daniel Kovtun,Emanuele Rossi,Guoqing Zhou,Srimukh Veccham,Clemens Isert,Yuxing Peng,Prabindh Sundareson,Mehmet Akdel,Gabriele Corso,Hannes Stärk,Gerardo Tauriello,Zachary Carpenter,Michael Bronstein,Emine Kucukbenli,Torsten Schwede,Luca Naef
DOI: https://doi.org/10.1101/2024.07.17.603955
2024-07-19
Abstract:Protein-ligand interactions (PLI) are foundational to small molecule drug design. With computational methods striving towards experimental accuracy, there is a critical demand for a well-curated and diverse PLI dataset. Existing datasets are often limited in size and diversity, and commonly used evaluation sets suffer from training information leakage, hindering the realistic assessment of method generalization capabilities.
To address these shortcomings, we present PLINDER, the largest and most annotated dataset to date, comprising 449,383 PLI systems, each with over 500 annotations, similarity metrics at protein, pocket, interaction and ligand levels, and paired unbound (apo) and predicted structures.
We propose an approach to generate training and evaluation splits that minimizes task-specific leakage and maximizes test set quality, and compare the resulting performance of DiffDock when re-trained with different kinds of splits.
Biochemistry