Deep Confident Steps to New Pockets: Strategies for Docking Generalization

Gabriele Corso,Arthur Deng,Benjamin Fry,Nicholas Polizzi,Regina Barzilay,Tommi Jaakkola
2024-02-28
Abstract:Accurate blind docking has the potential to lead to new biological breakthroughs, but for this promise to be realized, docking methods must generalize well across the proteome. Existing benchmarks, however, fail to rigorously assess generalizability. Therefore, we develop DockGen, a new benchmark based on the ligand-binding domains of proteins, and we show that existing machine learning-based docking models have very weak generalization abilities. We carefully analyze the scaling laws of ML-based docking and show that, by scaling data and model size, as well as integrating synthetic data strategies, we are able to significantly increase the generalization capacity and set new state-of-the-art performance across benchmarks. Further, we propose Confidence Bootstrapping, a new training paradigm that solely relies on the interaction between diffusion and confidence models and exploits the multi-resolution generation process of diffusion models. We demonstrate that Confidence Bootstrapping significantly improves the ability of ML-based docking methods to dock to unseen protein classes, edging closer to accurate and generalizable blind docking methods.
Biomolecules,Machine Learning
What problem does this paper attempt to address?
The focus of this paper is on the challenges of molecular docking in the field of biochemistry, which is crucial for drug discovery. Existing evaluation methods have failed to fully test the generalization ability of docking methods within the protein universe. Therefore, researchers have developed a new benchmark called DOCKGEN, which is based on the protein's ligand binding domain and reveals the poor performance of existing machine learning-driven docking models in predicting the binding conformation of unknown binding pockets. The paper also analyzes the scaling laws of machine learning-based docking methods, showing that the generalization ability can be significantly improved by increasing the data and model size, as well as integrating synthetic data strategies, achieving new records on various benchmarks. In addition, the paper proposes a new training paradigm called "CONFIDENCE BOOTSTRAPPING", which relies only on the interaction between diffusion and confidence models, and utilizes the multi-resolution generation process of the diffusion model to update the likelihood of early diffusion steps. This approach enhances the docking ability of machine learning-driven docking methods for unseen protein classes. The researchers demonstrate these improvements on the newly proposed DOCKGEN benchmark, showing that even with the currently available data and computational resources, increasing the scale alone may not be sufficient to completely bridge the generalization gap. To overcome this challenge, they propose CONFIDENCE BOOTSTRAPPING, a self-training solution that allows fine-tuning of unseen protein-ligand complexes without access to structural data. By interacting between the diffusion model and the confidence model, this approach gradually improves the performance of the model on unseen targets, effectively narrowing the generalization gap. Experimental results show that through CONFIDENCE BOOTSTRAPPING, the success rate of the docking method on unseen protein classes is significantly improved, indicating an important step towards accurate and generalizable blind docking methods.