DEL-Ranking: Ranking-Correction Denoising Framework for Elucidating Molecular Affinities in DNA-Encoded Libraries

Hanqun Cao,Chunbin Gu,Mutian He,Ning Ma,Chang-yu Hsieh,Pheng-Ann Heng
2024-10-19
Abstract:DNA-encoded library (DEL) screening has revolutionized the detection of protein-ligand interactions through read counts, enabling rapid exploration of vast chemical spaces. However, noise in read counts, stemming from nonspecific interactions, can mislead this exploration process. We present DEL-Ranking, a novel distribution-correction denoising framework that addresses these challenges. Our approach introduces two key innovations: (1) a novel ranking loss that rectifies relative magnitude relationships between read counts, enabling the learning of causal features determining activity levels, and (2) an iterative algorithm employing self-training and consistency loss to establish model coherence between activity label and read count predictions. Furthermore, we contribute three new DEL screening datasets, the first to comprehensively include multi-dimensional molecular representations, protein-ligand enrichment values, and their activity labels. These datasets mitigate data scarcity issues in AI-driven DEL screening research. Rigorous evaluation on diverse DEL datasets demonstrates DEL-Ranking's superior performance across multiple correlation metrics, with significant improvements in binding affinity prediction accuracy. Our model exhibits zero-shot generalization ability across different protein targets and successfully identifies potential motifs determining compound binding affinity. This work advances DEL screening analysis and provides valuable resources for future research in this area.
Machine Learning,Artificial Intelligence,Biomolecules
What problem does this paper attempt to address?
This paper attempts to solve the noise problem in read counts during the DNA - Encoded Library (DEL) screening process, especially the noise caused by non - specific interactions. These problems may lead to misjudgment of the binding affinity between compounds and proteins, thus affecting the lead compound identification in the drug discovery process. Specifically, the existing methods have two key deficiencies when dealing with DEL data: 1. **Ignoring the inherent ordering nature of read counts**: Read counts are not only a reflection of absolute values, but also their relative magnitude relationships are very important. 2. **Failing to fully utilize the real activity labels to correct system - wide biases**: By introducing activity labels, the systematic errors in read counts can be better adjusted. To solve these problems, the authors propose a new framework named DEL - Ranking. This framework contains the following innovations: 1. **A new Ranking Loss function**: By correcting the relative magnitude relationships between read counts, the model can learn the causal features that determine the activity level. 2. **An iterative algorithm**: Combining self - training and consistency loss to ensure the consistency between activity label prediction and read count prediction. 3. **A data set of multi - dimensional molecular representations**: Three new DEL screening data sets are provided. These data sets include not only two - dimensional and three - dimensional molecular structures, but also protein - ligand enrichment values and their activity labels, solving the problems of data scarcity and incompleteness in existing data sets. Through these improvements, DEL - Ranking performs well on multiple relevant metrics, especially showing a significant improvement in binding affinity prediction accuracy. In addition, the model also demonstrates zero - shot generalization ability across different protein targets and successfully identifies potential motifs that determine the binding affinity of compounds. In summary, this paper aims to improve the ability to accurately predict binding affinity from DEL screening data by proposing the DEL - Ranking framework, thereby accelerating drug discovery and enhancing lead compound identification.