Abstract:DNA-encoded library (DEL) screening has revolutionized the detection of protein-ligand interactions through read counts, enabling rapid exploration of vast chemical spaces. However, noise in read counts, stemming from nonspecific interactions, can mislead this exploration process. We present DEL-Ranking, a novel distribution-correction denoising framework that addresses these challenges. Our approach introduces two key innovations: (1) a novel ranking loss that rectifies relative magnitude relationships between read counts, enabling the learning of causal features determining activity levels, and (2) an iterative algorithm employing self-training and consistency loss to establish model coherence between activity label and read count predictions. Furthermore, we contribute three new DEL screening datasets, the first to comprehensively include multi-dimensional molecular representations, protein-ligand enrichment values, and their activity labels. These datasets mitigate data scarcity issues in AI-driven DEL screening research. Rigorous evaluation on diverse DEL datasets demonstrates DEL-Ranking's superior performance across multiple correlation metrics, with significant improvements in binding affinity prediction accuracy. Our model exhibits zero-shot generalization ability across different protein targets and successfully identifies potential motifs determining compound binding affinity. This work advances DEL screening analysis and provides valuable resources for future research in this area.

What problem does this paper attempt to address?

This paper attempts to solve the noise problem in read counts during the DNA - Encoded Library (DEL) screening process, especially the noise caused by non - specific interactions. These problems may lead to misjudgment of the binding affinity between compounds and proteins, thus affecting the lead compound identification in the drug discovery process. Specifically, the existing methods have two key deficiencies when dealing with DEL data: 1. **Ignoring the inherent ordering nature of read counts**: Read counts are not only a reflection of absolute values, but also their relative magnitude relationships are very important. 2. **Failing to fully utilize the real activity labels to correct system - wide biases**: By introducing activity labels, the systematic errors in read counts can be better adjusted. To solve these problems, the authors propose a new framework named DEL - Ranking. This framework contains the following innovations: 1. **A new Ranking Loss function**: By correcting the relative magnitude relationships between read counts, the model can learn the causal features that determine the activity level. 2. **An iterative algorithm**: Combining self - training and consistency loss to ensure the consistency between activity label prediction and read count prediction. 3. **A data set of multi - dimensional molecular representations**: Three new DEL screening data sets are provided. These data sets include not only two - dimensional and three - dimensional molecular structures, but also protein - ligand enrichment values and their activity labels, solving the problems of data scarcity and incompleteness in existing data sets. Through these improvements, DEL - Ranking performs well on multiple relevant metrics, especially showing a significant improvement in binding affinity prediction accuracy. In addition, the model also demonstrates zero - shot generalization ability across different protein targets and successfully identifies potential motifs that determine the binding affinity of compounds. In summary, this paper aims to improve the ability to accurately predict binding affinity from DEL screening data by proposing the DEL - Ranking framework, thereby accelerating drug discovery and enhancing lead compound identification.

DEL-Ranking: Ranking-Correction Denoising Framework for Elucidating Molecular Affinities in DNA-Encoded Libraries

DEL-Dock: Molecular Docking-Enabled Modeling of DNA-Encoded Libraries

Compositional Deep Probabilistic Models of DNA Encoded Libraries

DEL+ML paradigm for actionable hit discovery – a cross DEL and cross ML model assessment.

Quantitative Comparison of Enrichment from DNA-Encoded Chemical Library Selections

Machine learning on DNA-encoded library count data using an uncertainty-aware probabilistic loss function

Machine-Learning-Based Data Analysis Method for Cell-Based Selection of DNA-Encoded Libraries

Unlocking Potential Binders: Multimodal Pretraining DEL-Fusion for Denoising DNA-Encoded Libraries

Enabling Open Machine Learning of DNA Encoded Library Selections to Accelerate the Discovery of Small Molecule Protein Binders

Enhancing the Predictive Power of Machine Learning Models through a Chemical Space Complementary DEL Screening Strategy

Machine learning on DNA-encoded libraries: A new paradigm for hit-finding

Quantitative Validation and Application of the Photo-Cross-Linking Selection for Double-Stranded DNA-Encoded Libraries

Screening Ultra-Large Encoded Compound Libraries Leads to Novel Protein-Ligand Interactions and High Selectivity

Challenges and Prospects of DNA-Encoded Library Data Interpretation

Evaluating the Diversity and Target Addressability of DNA-encoded Libraries using BM-Scaffold Analysis and Machine Learning

KinDEL: DNA-Encoded Library Dataset for Kinase Inhibitors

DEELIG: A Deep Learning Approach to Predict Protein-Ligand Binding Affinity

Development of a DNA-encoded library screening method DEL Zipper to empower the study of RNA-targeted chemical matter

Partial Product Aware Machine Learning on DNA-Encoded Libraries

Identification of isoform/domain-selective fragments from the selection of DNA-encoded dynamic library

Selecting a DNA-Encoded Chemical Library Against Non-immobilized Proteins Using a “Ligate–cross-Link–purify” Strategy