Enabling Open Machine Learning of DNA Encoded Library Selections to Accelerate the Discovery of Small Molecule Protein Binders

Rafael Couñago,James Wellnitz,Shabbir Ahmad,Nabin Begale,Jermiah Joseph,Hong Zeng,Albina Bolotokova,Aiping Dong,Shaghayegh Reza,Pegah Ghiabi,Gibson Elisa,Xuemin Cheng,Guiping Tu,Xianyang Li,Jian Liu,Dengfeng Dou,Rachel J. Harding,Aled M. Edwards,Benjamin Haibe-Kains,Levon Halabelian,Alexander Tropsha,Jin Li
DOI: https://doi.org/10.26434/chemrxiv-2024-xd385
2024-10-18
Abstract:Recent advances in DNA-encoded library (DEL) screening have created bioactivity datasets containing billions of molecules, unlocking new opportunities for machine learning (ML) in drug discovery. However, most ultra-large DEL libraries are proprietary, limiting the advancement of ML tools for big chemical data analytics and hindering the democratization of DEL-ML technology. We address this gap by developing an open, end-to-end DEL-ML framework using public datasets, where enriched binders are represented by common chemical fingerprints, ensuring proprietary data protection. We demonstrate that ML models can be built and validated on fingerprinted DEL data and then applied to virtual screening (VS) of billion-sized, publicly accessible chemical libraries. As a proof-of-concept, we screened the human protein WDR91 using the HitGen OpenDEL library (3 billion molecules) and trained ML models, which were used to screen the Enamine REAL Space library (37 billion molecules). Fifty potential binders were identified, 48 of which were tested, and seven were confirmed as novel binders with dissociation constants (KD) from 2.7 to 21 μM that were successfully co-crystalized with WDR91. This fully automated, open-source workflow demonstrates the potential of DEL-ML models in discovering novel binders and promotes the use of open chemical bioactivity datasets and ML to accelerate drug discovery.
Chemistry
What problem does this paper attempt to address?
This paper aims to address a key challenge in small - molecule drug discovery, namely how to efficiently identify "hit" compounds (hits) suitable for further chemical optimization. Specifically, the paper accelerates the discovery of small - molecule protein binders by developing an open end - to - end DNA - Encoded Library (DEL) - combined - with - Machine - Learning (ML) framework and using publicly available datasets. This framework solves the problem of limited development of machine - learning tools in existing technologies due to data exclusivity and confidentiality, and promotes the democratization of DEL - ML technology. The main contributions of the paper include: 1. **Developed an open DEL - ML framework**: This framework uses public datasets to represent enriched binders by chemical fingerprints, protecting proprietary data while ensuring the training and validation of machine - learning models. 2. **Demonstrated the application of ML models in virtual screening**: The paper shows how to build and validate ML models based on chemical fingerprints and apply them to virtual screening to find potential binders from publicly available chemical libraries on the scale of billions. 3. **Practical case study**: As a proof - of - concept, the paper uses HitGen's OpenDEL library (containing approximately 3 billion molecules) to screen the human protein WDR91, trains an ML model, and then uses it for virtual screening of the Enamine REAL Space library (containing approximately 37 billion molecules). Eventually, out of the 50 potential binders screened, 7 new binders were identified, with dissociation constants (\(K_D\)) ranging from 2.7 to 21 μM, and co - crystallization experiments were successfully carried out. Through these efforts, the paper not only demonstrates the potential of DEL - ML models in discovering new binders but also promotes the application of open chemical bioactivity datasets and machine - learning in accelerating drug discovery.