Abstract:Accurate blind docking has the potential to lead to new biological breakthroughs, but for this promise to be realized, docking methods must generalize well across the proteome. Existing benchmarks, however, fail to rigorously assess generalizability. Therefore, we develop DockGen, a new benchmark based on the ligand-binding domains of proteins, and we show that existing machine learning-based docking models have very weak generalization abilities. We carefully analyze the scaling laws of ML-based docking and show that, by scaling data and model size, as well as integrating synthetic data strategies, we are able to significantly increase the generalization capacity and set new state-of-the-art performance across benchmarks. Further, we propose Confidence Bootstrapping, a new training paradigm that solely relies on the interaction between diffusion and confidence models and exploits the multi-resolution generation process of diffusion models. We demonstrate that Confidence Bootstrapping significantly improves the ability of ML-based docking methods to dock to unseen protein classes, edging closer to accurate and generalizable blind docking methods.

What problem does this paper attempt to address?

The focus of this paper is on the challenges of molecular docking in the field of biochemistry, which is crucial for drug discovery. Existing evaluation methods have failed to fully test the generalization ability of docking methods within the protein universe. Therefore, researchers have developed a new benchmark called DOCKGEN, which is based on the protein's ligand binding domain and reveals the poor performance of existing machine learning-driven docking models in predicting the binding conformation of unknown binding pockets. The paper also analyzes the scaling laws of machine learning-based docking methods, showing that the generalization ability can be significantly improved by increasing the data and model size, as well as integrating synthetic data strategies, achieving new records on various benchmarks. In addition, the paper proposes a new training paradigm called "CONFIDENCE BOOTSTRAPPING", which relies only on the interaction between diffusion and confidence models, and utilizes the multi-resolution generation process of the diffusion model to update the likelihood of early diffusion steps. This approach enhances the docking ability of machine learning-driven docking methods for unseen protein classes. The researchers demonstrate these improvements on the newly proposed DOCKGEN benchmark, showing that even with the currently available data and computational resources, increasing the scale alone may not be sufficient to completely bridge the generalization gap. To overcome this challenge, they propose CONFIDENCE BOOTSTRAPPING, a self-training solution that allows fine-tuning of unseen protein-ligand complexes without access to structural data. By interacting between the diffusion model and the confidence model, this approach gradually improves the performance of the model on unseen targets, effectively narrowing the generalization gap. Experimental results show that through CONFIDENCE BOOTSTRAPPING, the success rate of the docking method on unseen protein classes is significantly improved, indicating an important step towards accurate and generalizable blind docking methods.

Deep Confident Steps to New Pockets: Strategies for Docking Generalization

DeltaDock: A Unified Framework for Accurate, Efficient, and Physically Reliable Molecular Docking

DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking

Multi-scale Iterative Refinement towards Robust and Versatile Molecular Docking

Deep Learning for Protein-Ligand Docking: Are We There Yet?

Deep-Learning Based Docking Methods: Fair Comparisons to Conventional Docking Workflows

DOCKSTRING: Easy Molecular Docking Yields Better Benchmarks for Ligand Design

Combining Docking Pose Rank and Structure with Deep Learning Improves Protein–Ligand Binding Mode Prediction over a Baseline Docking Approach

Cobdock: an accurate and practical machine learning-based consensus blind docking method

PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences

Boosting Deep Learning-based Docking with Cross-attention and Centrality Embedding

ApoDock: Ligand-Conditioned Sidechain Packing for Flexible Molecular Docking

DockGame: Cooperative Games for Multimeric Rigid Protein Docking

Re-Dock: Towards Flexible and Realistic Molecular Docking with Diffusion Bridge

Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-ligand Structure Prediction Models

DeepDock: Enhancing Ligand-protein Interaction Prediction by a Combination of Ligand and Structure Information

DSDP: A Blind Docking Strategy Accelerated by GPUs

Uni-Mol Docking V2: Towards Realistic and Accurate Binding Pose Prediction

DockOpt: A Tool for Automatic Optimization of Docking Models

Pose Ensemble Graph Neural Networks to Improve Docking Performances.