A compound-target pairs dataset: differences between drugs, clinical candidates and other bioactive compounds

A. Lina Heinzke,Barbara Zdrazil,Paul D. Leeson,Robert J. Young,Axel Pahl,Herbert Waldmann,Andrew R. Leach
DOI: https://doi.org/10.26434/chemrxiv-2024-vj70m-v2
2024-03-11
Abstract:Providing a better understanding of what makes a compound a successful drug candidate is crucial for reducing the high attrition rates in drug discovery. Analyses of the differences between active compounds, clinical candidates and drugs require high-quality datasets. However, most datasets of drug discovery programs are not openly available. This work introduces a dataset of compound-target pairs extracted from the open-source bioactivity database ChEMBL (release 32). Compound-target pairs in the dataset either have at least one measured activity or are part of the manually curated set of known interactions in ChEMBL. Known interactions between drugs or clinical candidates and targets are specifically annotated to facilitate analyses on differences between drugs, clinical candidates, and other active compounds. In total, the dataset comprises 614,594 compound-target pairs, 5,109 (3,932) of which are known interactions between drugs (clinical candidates) and targets. The extraction is performed in an automated manner and fully reproducible. We are providing not only the datasets but also the code to rerun the analyses with other ChEMBL releases.
Chemistry
What problem does this paper attempt to address?
This paper aims to address the difficulties in the process of drug discovery where compounds become successful drug candidates. In this study, the authors curated a dataset of compound-target pairs from the 32nd edition of the open-source bioactivity database ChEMBL. The dataset consists of measured activities of compound-target pairs and known interaction compound-target pairs related to disease that were manually compiled from ChEMBL. The dataset specifically annotates known interactions between drugs, clinical candidates, and targets to facilitate the analysis of differences between drugs, clinical candidates, and other active compounds. It includes a total of 614,594 compound-target pairs, of which 5,109 (3,932) pairs are known interactions between drugs (clinical candidates) and targets. The extraction of the dataset is automated and reproducible, allowing for updates with each new release of ChEMBL. The paper also discusses challenges arising from incomplete data and biases due to scientific interests. Nevertheless, the automatically generated dataset still reflects the current state of knowledge and contributes to exploring relevant issues in drug discovery. Furthermore, the dataset provides a distribution analysis for different target categories, revealing that about half of the targets are enzymes, especially kinases. A subset of the dataset, BF_100_c_dt_d_dt, includes only targets with at least 100 active compounds and at least one known drug or clinical candidate interaction, enabling more targeted analysis for drug discovery.