Improving generalisability of 3D binding affinity models in low data regimes

Julia Buhmann,Ward Haddadin,Lukáš Pravda,Alan Bilsland,Hagen Triendl
2024-09-19
Abstract:Predicting protein-ligand binding affinity is an essential part of computer-aided drug design. However, generalisable and performant global binding affinity models remain elusive, particularly in low data regimes. Despite the evolution of model architectures, current benchmarks are not well-suited to probe the generalisability of 3D binding affinity models. Furthermore, 3D global architectures such as GNNs have not lived up to performance expectations. To investigate these issues, we introduce a novel split of the PDBBind dataset, minimizing similarity leakage between train and test sets and allowing for a fair and direct comparison between various model architectures. On this low similarity split, we demonstrate that, in general, 3D global models are superior to protein-specific local models in low data regimes. We also demonstrate that the performance of GNNs benefits from three novel contributions: supervised pre-training via quantum mechanical data, unsupervised pre-training via small molecule diffusion, and explicitly modeling hydrogen atoms in the input graph. We believe that this work introduces promising new approaches to unlock the potential of GNN architectures for binding affinity modelling.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the issue of model generalization in protein-ligand binding affinity prediction, especially in scenarios with limited data. Specifically: 1. **Model Generalization**: Current binding affinity models struggle to achieve good generalization performance under low data conditions. The paper attempts to evaluate the generalization ability of different model architectures by introducing a new data set partitioning method. 2. **Model Architecture Comparison**: The study investigates the performance differences between global 3D models (such as Graph Neural Networks, GNN) and specific local protein models. The results show that under low data conditions, global 3D models outperform local models. 3. **Novel Pre-training Strategies**: Two novel pre-training methods are proposed—quantum mechanics supervised pre-training and small molecule diffusion unsupervised pre-training—to enhance the performance of GNN models. These methods demonstrate significant advantages under low data conditions. 4. **Role of Hydrogen Atoms**: The impact of explicitly including hydrogen atoms in the input graph on model performance is explored. The study finds that explicitly including hydrogen atoms is crucial for improving model generalization ability under low data conditions. In summary, the paper focuses on improving the generalization ability and performance of binding affinity prediction models under low data conditions and proposes a series of new methods and techniques to achieve this goal.