DrugGym: A testbed for the economics of autonomous drug discovery

Michael Retchin,Yuanqing Wang,Kenichiro Takaba,John D Chodera
DOI: https://doi.org/10.1101/2024.05.28.596296
2024-06-02
Abstract:Drug discovery is stochastic. The effectiveness of candidate compounds in satisfying design objectives is unknown ahead of time, and the tools used for prioritization---predictive models and assays---are inaccurate and noisy. In a typical discovery campaign, thousands of compounds may be synthesized and tested before design objectives are achieved, with many others ideated but deprioritized. These challenges are well-documented, but assessing potential remedies has been difficult. We introduce DrugGym, a framework for modeling the stochastic process of drug discovery. Emulating biochemical assays with realistic surrogate models, we simulate the progression from weak hits to sub-micromolar leads with viable ADME. We use this testbed to examine how different ideation, scoring, and decision-making strategies impact statistical measures of utility, such as the probability of program success within predefined budgets and the expected costs to achieve target candidate profile (TCP) goals. We also assess the influence of affinity model inaccuracy, chemical creativity, batch size, and multi-step reasoning. Our findings suggest that reducing affinity model inaccuracy from 2 to 0.5 pIC50 units improves budget-constrained success rates tenfold. DrugGym represents a realistic testbed for machine learning methods applied to the hit-to-lead phase. Source code is available at www.drug-gym.org.
Biophysics
What problem does this paper attempt to address?
The problem addressed in this paper is how to optimize the economy of the drug discovery process, especially in terms of improving success rates and reducing costs within budget and time constraints. Drug discovery is a complex and uncertain process that involves designing, synthesizing, and analyzing multiple compounds to meet the criteria for target candidate drugs (TCP). Currently, this process is inefficient and costly. DrugGym is a simulation framework used to simulate the random process of drug discovery by simulating biochemical experiments and using realistic surrogate models to simulate the progress from weakly active compounds to submicromolar lead compounds with feasible ADME properties. This framework allows for the study of the impact of different strategies, scoring methods, and decision-making on statistical utility metrics, such as the probability of success within a predefined budget and the expected cost to achieve TCP goals. The paper evaluates the impact of factors such as model accuracy, chemical novelty, batch size, and multi-step reasoning on success rates and costs through DrugGym. For example, reducing the inaccuracy of the affinity model can significantly increase success rates under budget constraints. DrugGym provides a realistic testing platform for machine learning methods applied to the "hit to lead" stage of drug discovery. Through this approach, researchers can explore the economics of drug discovery, analyze the effectiveness of different decision-making strategies, and seek methods to improve efficiency and reduce costs to address the challenges in drug development.