Lessons learned during the journey of data: from experiment to model for predicting kinase affinity, selectivity, polypharmacology, and resistance

Raquel López-Ríos de Castro,Jaime Rodríguez-Guerra,David Schaller,Talia B. Kimber,Corey Taylor,Jessica B. White,Michael Backenköhler,Alexander Payne,Ben Kaminow,Iván Pulido,Sukrit Singh,Paula Linh Kramer,Guillermo Pérez-Hernández,Andrea Volkamer,John D. Chodera
DOI: https://doi.org/10.1101/2024.09.10.612176
2024-09-10
Abstract:Recent advances in machine learning (ML) are reshaping drug discovery. Structure-based ML methods use physically-inspired models to predict binding affinities from protein:ligand complexes. These methods promise to enable the integration of data for many related targets, which addresses issues related to data scarcity for single targets and could enable generalizable predictions for a broad range of targets, including mutants. In this work, we report our experiences in building KinoML, a novel framework for ML in target-based small molecule drug discovery with an emphasis on structure-enabled methods. KinoML focuses currently on kinases as the relative structural conservation of this protein superfamily, particularly in the kinase domain, means it is possible to leverage data from the entire superfamily to make structure-informed predictions about binding affinities, selectivities, and drug resistance. Some key lessons learned in building KinoML include: the importance of reproducible data collection and deposition, the harmonization of molecular data and featurization, and the choice of the right data format to ensure reusability and reproducibility of ML models. As a result, KinoML allows users to easily achieve three tasks: accessing and curating molecular data; featurizing this data with representations suitable for ML applications; and running reproducible ML experiments that require access to ligand, protein, and assay information to predict ligand affinity. Despite KinoML focusing on kinases, this framework can be applied to other proteins. The lessons reported here can help guide the development of platforms for structure-enabled ML in other areas of drug discovery.
Biophysics
What problem does this paper attempt to address?
The paper attempts to address the problem of how to effectively collect, process, and apply kinase-related data in structure-based machine learning (ML) methods to improve performance in binding affinity prediction, selectivity, polypharmacology, and resistance in drug discovery. Specifically: 1. **Constructing the KinoML Framework**: The paper introduces a new machine learning framework, KinoML, which focuses on structure-based methods to discover small molecule drugs, particularly targeting the kinase protein family. By leveraging the highly conserved structural characteristics of the kinase family, KinoML can integrate data across the entire kinase superfamily to predict binding affinity, selectivity, and drug resistance guided by structural information. 2. **Overcoming Data Challenges**: Despite the vast amount of kinase-related data, effectively organizing this data and ensuring its accuracy and reproducibility is a significant challenge. The paper discusses how to acquire and curate data from different sources and emphasizes the importance of adhering to the FAIR principles (Findable, Accessible, Interoperable, and Reusable). 3. **Comparison of Structured and Unstructured Methods**: The paper also compares ligand-based methods with structure-based methods in kinase drug discovery, noting that structure-based methods may have better generalization capabilities due to their ability to integrate information about the relevant targets, especially when dealing with mutants. 4. **Lessons Learned**: The authors share key lessons learned during the development process, including the reproducibility of data collection, standardization of molecular data, and the choice of appropriate data formats. These lessons are not only applicable to kinase research but can also be extended to other drug discovery fields. In summary, the paper aims to address data processing and model training issues in structure-based kinase drug discovery by constructing a modular and extensible ML framework and provides practical solutions and lessons learned.