CompassDock: Comprehensive Accurate Assessment Approach for Deep Learning-Based Molecular Docking in Inference and Fine-Tuning

Ahmet Sarigun,Vedran Franke,Bora Uyar,Altuna Akalin
2024-10-01
Abstract:Datasets used for molecular docking, such as PDBBind, contain technical variability - they are noisy. Although the origins of the noise have been discussed, a comprehensive analysis of the physical, chemical, and bioactivity characteristics of the datasets is still lacking. To address this gap, we introduce the Comprehensive Accurate Assessment (Compass). Compass integrates two key components: PoseCheck, which examines ligand strain energy, protein-ligand steric clashes, and interactions, and AA-Score, a new empirical scoring function for calculating binding affinity energy. Together, these form a unified workflow that assesses both the physical/chemical properties and bioactivity favorability of ligands and protein-ligand interactions. Our analysis of the PDBBind dataset using Compass reveals substantial noise in the ground truth data. Additionally, we propose CompassDock, which incorporates the Compass module with DiffDock, the state-of-the-art deep learning-based molecular docking method, to enable accurate assessment of docked ligands during inference. Finally, we present a new paradigm for enhancing molecular docking model performance by fine-tuning with Compass Scores, which encompass binding affinity energy, strain energy, and the number of steric clashes identified by Compass. Our results show that, while fine-tuning without Compass improves the percentage of docked poses with RMSD < 2Å, it leads to a decrease in physical/chemical and bioactivity favorability. In contrast, fine-tuning with Compass shows a limited improvement in RMSD < 2Å but enhances the physical/chemical and bioactivity favorability of the ligand conformation. The source code is available publicly at <a class="link-external link-https" href="https://github.com/BIMSBbioinfo/CompassDock" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Quantitative Methods
What problem does this paper attempt to address?
The paper attempts to address the following issues: 1. **Dataset Noise Issue**: Existing molecular docking datasets (such as PDBBind) contain a significant amount of technical variability, i.e., noise in the dataset. This noise affects the training and prediction performance of models, especially in the evaluation of physical, chemical, and bioactive properties. 2. **Limitations of Existing Evaluation Metrics**: Traditional molecular docking methods typically use RMSD (Root Mean Square Deviation) as an evaluation metric, but RMSD is based solely on distance and cannot comprehensively reflect the physicochemical interactions and bioactive properties between the ligand and the protein. Therefore, even if the RMSD value is low, it does not necessarily mean that the binding state of the ligand and protein is ideal in terms of physicochemical and bioactive properties. 3. **Model Optimization Issue**: Existing deep learning molecular docking methods often focus only on improving distance metrics such as RMSD during model fine-tuning, neglecting the optimization of physicochemical properties and bioactive characteristics. This leads to a situation where, although RMSD improves, the physicochemical stability and bioactivity of the ligand may actually decrease. To address the above issues, the paper proposes a comprehensive evaluation framework **CompassDock**, which includes two main modes: - **Inference Mode**: During the deep learning molecular docking process, the generated ligand conformations are comprehensively evaluated for physicochemical and bioactive properties through the integrated **Compass** module. Specifically, the **Compass** module includes two components, **PoseCheck** and **AA-Score**, which are used to evaluate ligand strain energy, protein-ligand spatial collisions, and binding affinity energy, respectively. - **Fine-Tuning Mode**: A new fine-tuning method is proposed, which introduces the **Compass Score** as part of the loss function to optimize the model's performance in terms of physicochemical and bioactive properties. The **Compass Score** comprehensively considers binding affinity energy, strain energy, and spatial collision count, and is calculated using the **Log Absolute Normalized - Mean Square Error (LAN-MSE)** loss function to reduce the impact of outliers and improve the robustness of the model. Through these two modes, **CompassDock** aims to provide a more comprehensive and accurate molecular docking evaluation and optimization method, thereby improving the efficiency and accuracy of candidate drug screening in the drug discovery process.