Predicting Three-Component Reaction Outcomes from 40k Miniaturized Reactant Combinations

Jeffrey Bode,Julian Götz,Euan Richards,Yu Takahashi,Yi-Lin Huang,Louis Bertschi,Bertran Rubi,Iain Stepek
DOI: https://doi.org/10.26434/chemrxiv-2024-5328b
2024-05-13
Abstract:Efficient drug discovery relies on accessing diverse small molecules expediently and reliably. Improvements to reliability through machine learning predictions are hampered by poor availability of high-quality reaction data. Here, we introduce an on-demand synthesis platform based on a three-component reaction that delivers drug-like molecules overnight. Miniaturization and automation enable the execution and analysis of 50,000 reactions on a 3 microliter scale with distinct substrates, producing the largest public reaction outcome dataset. With machine learning, we accurately predict the result of unknown reactions and analyze the impact of data set size on model training. This study advances the on-demand synthesis of drug-like molecules through concatenating chemoselective reactions and provides a sufficiently large data set to critically evaluate emerging machine learning approaches to predicting chemical reactivity.
Chemistry
What problem does this paper attempt to address?
The paper mainly discusses the unresolved problem of predicting organic reaction outcomes using data-driven methods. The research team designed an on-demand synthesis platform based on three-component reactions and executed and analyzed a large number of reactions through miniaturization and automation techniques, resulting in the largest publicly available reaction outcome dataset to date, which includes approximately 40,000 reactions. They used machine learning to accurately predict the outcomes of unknown reactions and analyzed the impact of dataset size on model training. The paper mentions that although machine learning holds promise for improving the reliability of drug discovery, the scarcity of high-quality reaction data limits its development. The research team generated drug-like molecules through three-component reactions and performed 50,000 reactions, tracking the formation of eight different products for each combination using liquid chromatography-high-resolution mass spectrometry (LC-HRMS) analysis. These data were used to train machine learning models to predict the outcomes of unknown reactions and evaluate the dependence of predictive chemical reactivity on data volume, model type, and dataset size. The study also involved reaction development and automation, including the selection of specific reactants and optimization of reaction conditions for high-throughput synthesis. Experimental validation showed that the model performed well in predicting reaction outcomes, particularly for the prediction of the major product A, with an accuracy of up to 99%. In addition, the paper discusses the influence of dataset size on the predictive ability of machine learning models. The research found that even with a very sparse dataset (approximately 1% of the combination space), predictive models could still be trained. However, as the dataset increased in size, chemical perception models such as XGB/FP gradually outperformed simple models such as FFN/OHE in complex prediction tasks. Overall, this study advances the progress of on-demand synthesis of drug-like molecules through a large-scale experimental dataset and provides a benchmark for evaluating the predictive chemical reactivity of emerging machine learning methods.