Abstract:Numerous disciplines, such as image recognition and language translation, have been revolutionized by using machine learning (ML) to leverage big data. In organic synthesis, providing accurate chemical reactivity predictions with supervised ML could assist chemists with reaction prediction, optimization, and mechanistic interrogation.To apply supervised ML to chemical reactions, one needs to define the object of prediction (e.g., yield, enantioselectivity, solubility, or a recommendation) and represent reactions with descriptive data. Our group's effort has focused on representing chemical reactions using DFT-derived physical features of the reacting molecules and conditions, which serve as features for building supervised ML models.In this Account, we present a review and perspective on three studies conducted by our group where ML models have been employed to predict reaction yield. First, we focus on a small reaction data set where 16 phosphine ligands were evaluated in a single Ni-catalyzed Suzuki–Miyaura cross-coupling reaction, and the reaction yield was modeled with linear regression. In this setting, where the regression complexity is strongly limited by the amount of available data, we emphasize the importance of identifying single features that are directly relevant to reactivity. Next, we focus on models trained on two larger data sets obtained with high-throughput experimentation (HTE). With hundreds to thousands of reactions available, more complex models can be explored, for example, models that algorithmically perform feature selection from a broad set of candidate features. We examine how a variety of ML algorithms model these data sets and how well these models generalize to out-of-sample substrates. Specifically, we compare the ML models that use DFT-based featurization to a baseline model that is obtained with features that carry no physical information, that is, random features, and to a naive non-ML model that averages yields of reactions that share the same conditions and substrate combinations. We find that for only one of the two data sets, DFT-based featurization leads to a significant, although moderate, out-of-sample prediction improvement. The source of this improvement was further isolated to specific features which allowed us to formulate a testable mechanistic hypothesis that was validated experimentally. Finally, we offer remarks on supervised ML model building on HTE data sets focusing on algorithmic improvements in model training.Statistical methods in chemistry have a rich history, but only recently has ML gained widespread attention in reaction development. As the untapped potential of ML is explored, novel tools are likely to arise from future research. Our studies suggest that supervised ML can lead to improved predictions of reaction yield over simpler modeling methods and facilitate mechanistic understanding of reaction dynamics. However, further research and development is required to establish ML as an indispensable tool in reactivity modeling.The Supporting Information is available free of charge at <a class="ext-link" href="/doi/10.1021/acs.accounts.0c00770?goto=supporting-info">https://pubs.acs.org/doi/10.1021/acs.accounts.0c00770</a>.Optimized hyperparameter table (<a class="ext-link" href="/doi/suppl/10.1021/acs.accounts.0c00770/suppl_file/ar0c00770_si_001.pdf">PDF</a>)This article has not yet been cited by other publications.

Leveraging our Teacher’s Experience to Improve Machine Learning: Application to pKa Prediction

Bridging Chemical Knowledge and Machine Learning for Performance Prediction of Organic Synthesis.

Leveraging large language models for predictive chemistry

When machine learning meets molecular synthesis

Application of Machine Learning in Organic Chemistry

Interpretable deep-learning pKa prediction for small molecule drugs via atomic sensitivity analysis

Advances in machine learning with chemical language models in molecular property and reaction outcome predictions

Learning to Make Chemical Predictions: the Interplay of Feature Representation, Data, and Machine Learning Algorithms

Predicting Reaction Yields via Supervised Learning

Combining Machine Learning and Computational Chemistry for Predictive Insights Into Chemical Systems

Transformative Applications of Machine Learning for Chemical Reactions

PythiaCHEM : a user-friendly machine learning toolkit for chemistry

Deep Learning for Deep Chemistry: Optimizing the Prediction of Chemical Patterns

Large Language Models are Catalyzing Chemistry Education

Machine Learning of Molecular Electronic Properties in Chemical Compound Space

Beyond potential energy surface benchmarking: a complete application of machine learning to chemical reactivity

Machine Learning for Chemistry: Basics and Applications

Perspective on integrating machine learning into computational chemistry and materials science.

Machine Learning in Complex Organic Mixtures: Applying Domain Knowledge Allows for Meaningful Performance with Small Datasets.

Navigating with chemometrics and machine learning in chemistry

Another string to your bow: machine learning prediction of the pharmacokinetic properties of small molecules