Testing the predictive power of reverse screening to infer drug targets, with the help of machine learning

Antoine Daina,Vincent Zoete
DOI: https://doi.org/10.1038/s42004-024-01179-2
IF: 7.211
2024-05-10
Communications Chemistry
Abstract:Estimating protein targets of compounds based on the similarity principle —similar molecules are likely to show comparable bioactivity—is a long-standing strategy in drug research. Having previously quantified this principle, we present here a large-scale evaluation of its predictive power for inferring macromolecular targets by reverse screening an unprecedented vast external test set of more than 300,000 active small molecules against another bioactivity set of more than 500,000 compounds. We show that machine-learning can predict the correct targets, with the highest probability among 2069 proteins, for more than 51% of the external molecules. The strong enrichment thus obtained demonstrates its usefulness in supporting phenotypic screens, polypharmacology, or repurposing. Moreover, we quantified the impact of the bioactivity knowledge available for proteins in terms of number and diversity of actives. Finally, we advise that developers of such approaches follow an application-oriented benchmarking strategy and use large, high-quality, non-overlapping datasets as provided here.
chemistry, multidisciplinary
What problem does this paper attempt to address?
The paper primarily investigates the problem of drug target prediction, especially the use of reverse screening and machine learning methods to predict protein targets of small molecule drugs. The researchers conducted a large-scale evaluation of over 300,000 active small molecules, using a bioactivity dataset containing over 500,000 compounds as an external test set. They found that machine learning models could accurately predict the highest-probability targets among 2,069 protein targets for over 51% of the external molecules. This method is practical for supporting phenotype screening, polypharmacology, and drug repurposing. Furthermore, the study quantified the impact of the quantity and diversity of protein bioactivity knowledge on prediction performance, and suggested that developers adopt application-oriented benchmark testing strategies using large-scale, high-quality, non-overlapping datasets. The paper concludes by highlighting the wide-ranging potential applications of building such bioinformatics approaches in drug research, biology, and chemistry fields.