Abstract:Recent advances in DNA-encoded library (DEL) screening have created bioactivity datasets containing billions of molecules, unlocking new opportunities for machine learning (ML) in drug discovery. However, most ultra-large DEL libraries are proprietary, limiting the advancement of ML tools for big chemical data analytics and hindering the democratization of DEL-ML technology. We address this gap by developing an open, end-to-end DEL-ML framework using public datasets, where enriched binders are represented by common chemical fingerprints, ensuring proprietary data protection. We demonstrate that ML models can be built and validated on fingerprinted DEL data and then applied to virtual screening (VS) of billion-sized, publicly accessible chemical libraries. As a proof-of-concept, we screened the human protein WDR91 using the HitGen OpenDEL library (3 billion molecules) and trained ML models, which were used to screen the Enamine REAL Space library (37 billion molecules). Fifty potential binders were identified, 48 of which were tested, and seven were confirmed as novel binders with dissociation constants (KD) from 2.7 to 21 μM that were successfully co-crystalized with WDR91. This fully automated, open-source workflow demonstrates the potential of DEL-ML models in discovering novel binders and promotes the use of open chemical bioactivity datasets and ML to accelerate drug discovery.

What problem does this paper attempt to address?

This paper aims to address a key challenge in small - molecule drug discovery, namely how to efficiently identify "hit" compounds (hits) suitable for further chemical optimization. Specifically, the paper accelerates the discovery of small - molecule protein binders by developing an open end - to - end DNA - Encoded Library (DEL) - combined - with - Machine - Learning (ML) framework and using publicly available datasets. This framework solves the problem of limited development of machine - learning tools in existing technologies due to data exclusivity and confidentiality, and promotes the democratization of DEL - ML technology. The main contributions of the paper include: 1. **Developed an open DEL - ML framework**: This framework uses public datasets to represent enriched binders by chemical fingerprints, protecting proprietary data while ensuring the training and validation of machine - learning models. 2. **Demonstrated the application of ML models in virtual screening**: The paper shows how to build and validate ML models based on chemical fingerprints and apply them to virtual screening to find potential binders from publicly available chemical libraries on the scale of billions. 3. **Practical case study**: As a proof - of - concept, the paper uses HitGen's OpenDEL library (containing approximately 3 billion molecules) to screen the human protein WDR91, trains an ML model, and then uses it for virtual screening of the Enamine REAL Space library (containing approximately 37 billion molecules). Eventually, out of the 50 potential binders screened, 7 new binders were identified, with dissociation constants (\(K_D\)) ranging from 2.7 to 21 μM, and co - crystallization experiments were successfully carried out. Through these efforts, the paper not only demonstrates the potential of DEL - ML models in discovering new binders but also promotes the application of open chemical bioactivity datasets and machine - learning in accelerating drug discovery.

Enabling Open Machine Learning of DNA Encoded Library Selections to Accelerate the Discovery of Small Molecule Protein Binders

Machine learning on DNA-encoded libraries: A new paradigm for hit-finding

DEL+ML paradigm for actionable hit discovery – a cross DEL and cross ML model assessment.

Screening Ultra-Large Encoded Compound Libraries Leads to Novel Protein-Ligand Interactions and High Selectivity

Partial Product Aware Machine Learning on DNA-Encoded Libraries

Machine-Learning-Based Data Analysis Method for Cell-Based Selection of DNA-Encoded Libraries

Enhancing the Predictive Power of Machine Learning Models through a Chemical Space Complementary DEL Screening Strategy

Compositional Deep Probabilistic Models of DNA Encoded Libraries

Evaluating the Diversity and Target Addressability of DNA-encoded Libraries using BM-Scaffold Analysis and Machine Learning

Machine learning on DNA-encoded library count data using an uncertainty-aware probabilistic loss function

DEL-Dock: Molecular Docking-Enabled Modeling of DNA-Encoded Libraries

Challenges and Prospects of DNA-Encoded Library Data Interpretation

Future challenges with DNA-encoded chemical libraries in the drug discovery domain.

Highly Selective Novel Heme Oxygenase-1-Targeting Molecules Discovered by DNA-Encoded Library-Machine Learning Model beyond the DEL Chemical Space

Efficient Exploration of Chemical Space with Docking and Deep Learning

DEL-Ranking: Ranking-Correction Denoising Framework for Elucidating Molecular Affinities in DNA-Encoded Libraries

Design and Development of a Technology Platform for DNA-Encoded Library Production and Affinity Selection

Selecting a DNA-Encoded Chemical Library Against Non-immobilized Proteins Using a “Ligate–cross-Link–purify” Strategy

KinDEL: DNA-Encoded Library Dataset for Kinase Inhibitors

ABPP-CoDEL: Activity-Based Proteome Profiling-Guided Discovery of Tyrosine-Targeting Covalent Inhibitors from DNA-Encoded Libraries

Highly Selective Novel Heme Oxygenase-1 Hits Found by DNA-Encoded Library Machine Learning beyond the DEL Chemical Space