Abstract:Numerous disciplines, such as image recognition and language translation, have been revolutionized by using machine learning (ML) to leverage big data. In organic synthesis, providing accurate chemical reactivity predictions with supervised ML could assist chemists with reaction prediction, optimization, and mechanistic interrogation.To apply supervised ML to chemical reactions, one needs to define the object of prediction (e.g., yield, enantioselectivity, solubility, or a recommendation) and represent reactions with descriptive data. Our group's effort has focused on representing chemical reactions using DFT-derived physical features of the reacting molecules and conditions, which serve as features for building supervised ML models.In this Account, we present a review and perspective on three studies conducted by our group where ML models have been employed to predict reaction yield. First, we focus on a small reaction data set where 16 phosphine ligands were evaluated in a single Ni-catalyzed Suzuki–Miyaura cross-coupling reaction, and the reaction yield was modeled with linear regression. In this setting, where the regression complexity is strongly limited by the amount of available data, we emphasize the importance of identifying single features that are directly relevant to reactivity. Next, we focus on models trained on two larger data sets obtained with high-throughput experimentation (HTE). With hundreds to thousands of reactions available, more complex models can be explored, for example, models that algorithmically perform feature selection from a broad set of candidate features. We examine how a variety of ML algorithms model these data sets and how well these models generalize to out-of-sample substrates. Specifically, we compare the ML models that use DFT-based featurization to a baseline model that is obtained with features that carry no physical information, that is, random features, and to a naive non-ML model that averages yields of reactions that share the same conditions and substrate combinations. We find that for only one of the two data sets, DFT-based featurization leads to a significant, although moderate, out-of-sample prediction improvement. The source of this improvement was further isolated to specific features which allowed us to formulate a testable mechanistic hypothesis that was validated experimentally. Finally, we offer remarks on supervised ML model building on HTE data sets focusing on algorithmic improvements in model training.Statistical methods in chemistry have a rich history, but only recently has ML gained widespread attention in reaction development. As the untapped potential of ML is explored, novel tools are likely to arise from future research. Our studies suggest that supervised ML can lead to improved predictions of reaction yield over simpler modeling methods and facilitate mechanistic understanding of reaction dynamics. However, further research and development is required to establish ML as an indispensable tool in reactivity modeling.The Supporting Information is available free of charge at <a class="ext-link" href="/doi/10.1021/acs.accounts.0c00770?goto=supporting-info">https://pubs.acs.org/doi/10.1021/acs.accounts.0c00770</a>.Optimized hyperparameter table (<a class="ext-link" href="/doi/suppl/10.1021/acs.accounts.0c00770/suppl_file/ar0c00770_si_001.pdf">PDF</a>)This article has not yet been cited by other publications.

Personalized Machine Learning Models of Terminal Olefin Hydroformylation for Regioselectivity Prediction

On‐Line Monitoring Device for Gas Phase Composition Based on Machine Learning Models and Its Application in the Gas Phase Copolymerization of Olefins

Machine learning‐guided prediction of hydroformylation

Optimization of Phenyllactic Acid Biosynthesis and Separation by Machine Learning with Neural Network and Overlay Sampling Uniform Design

Machine Learning-Guided Yield Optimization for Palladaelectro-Catalyzed Annulation Reaction

Towards Data‐Driven Design of Asymmetric Hydrogenation of Olefins: Database and Hierarchical Learning

Predicting Regioselectivity in Radical C−H Functionalization of Heterocycles Through Machine Learning

Bridging Chemical Knowledge and Machine Learning for Performance Prediction of Organic Synthesis.

Predicting the Stereoselectivity of Chemical Transformations by Machine Learning

Machine Learning Prediction of Structure‐Performance Relationship in Organic Synthesis

Machine Learning Model Insights into Base-Catalyzed Hydrothermal Lignin Depolymerization

Experimentally-based Fe-catalyzed ethene oligomerization machine learning model provides highly accurate prediction of propagation/termination selectivity

Modeling Chemical Processes in Explicit Solvents with Machine Learning Potentials

A hybrid spatial-temporal deep learning prediction model of industrial methanol-to-olefins process

Machine learning-aided catalyst screening and multi-objective optimization for the indirect CO2 hydrogenation to methanol and ethylene glycol process

Integrating Machine Learning and Large Language Models to Advance Exploration of Electrochemical Reactions

Gas–solid reactor optimization based on EMMS-DPM simulation and machine learning

Probing machine learning models based on high throughput experimentation data for the discovery of asymmetric hydrogenation catalysts

Interpretable machine learning for accelerating the discovery of metal-organic frameworks for ethane/ethylene separation

Investigation and optimization of olefin purification in methanol-to-olefin process based on machine learning approach coupled with genetic algorithm

Predicting Reaction Yields via Supervised Learning