Integrating Machine Learning and Large Language Models to Advance Exploration of Electrochemical Reactions

Klavs F Jensen,Zhiling Zheng,Federico Florit,Brooke Jin,Haoyang Wu,Shih-Cheng Li ,Kakasaheb Y. Nandiwale ,Chase A. Salazar,Jason G. Mustakis,William H. Green

DOI: https://doi.org/10.26434/chemrxiv-2024-pk105-v2

2024-08-29

Abstract:Electrochemical C-H oxidation reactions offer a sustainable route to functionalize hydrocarbons, yet the identification of competent substrates and their synthesis optimization remains challenging. Here, we report an integrated approach combining machine learning (ML) and large language models (LLMs) to streamline the exploration of electrochemical C-H oxidation reactions. Utilizing a batch rapid screening electrochemical platform, we evaluated a wide range of reactions, initially classifying substrates by their reactivity, while LLMs text-mined literature data to augment the training set. The resulting ML models, one for reactivity prediction and the other one for site selectivity, both achieved high accuracy (>90%) and enabled virtual screening of a large set of commercially available molecules. To optimize reaction conditions of substrates of interest upon the screening, LLMs were prompted to generate code to iteratively improve yield, lowering the barrier for scientists to access ML programs, and this strategy efficiently identified high-yield conditions for eight drug-like substances or intermediates. Notably, we benchmarked the accuracy and reliability of 10 different LLMs, including llama, Claude, and GPT-4, on generating and executing codes related to ML based on natural language prompts given by chemists to showcase their tool-making and tool-using capabilities and potentials for accelerating research across four diverse tasks. In addition, we collected an experimental benchmark dataset comprising 1071 reaction conditions and yields for electrochemical C-H oxidation reactions, and our findings revealed that integrating LLMs and ML outperformed using either method alone. We envision that this combined approach offers a robust and generalizable pathway for advancing synthetic chemistry research

Chemistry

What problem does this paper attempt to address?

This paper attempts to solve two main problems in electrochemical C - H oxidation reactions: 1. **Identifying suitable substrates**: Electrochemical C - H oxidation reactions provide a sustainable route for functionalizing hydrocarbons, but selecting substrates that can effectively participate in the reaction remains a challenge. Specifically, researchers need to determine which compounds are suitable for electrochemical oxidation. 2. **Optimizing synthesis conditions**: Even if suitable substrates are found, how to optimize their synthesis conditions to obtain the best yield is also a complex problem. The traditional trial - and - error method is not only time - consuming but also resource - intensive, so a more intelligent workflow is required to accelerate this process. To address these challenges, the authors propose a method that combines machine learning (ML) and large - language models (LLM), aiming to improve the research on electrochemical C - H oxidation reactions in the following ways: - **Data collection and screening**: Use a rapid - screening electrochemical platform to evaluate a series of reactions and supplement the training set through literature mining to ensure that the data set contains both positive and negative samples, thereby improving the accuracy of the prediction model. - **Constructing prediction models**: Develop two ML models, one for predicting the reactivity of substrates and the other for predicting the selectivity of specific carbon atoms. Both models achieve an accuracy rate of over 90%, making it possible to virtually screen a large number of commercial molecules. - **Optimizing reaction conditions**: For the screened substrates of interest, use an active - learning protocol to iteratively optimize the reaction conditions and quickly find high - yield synthesis schemes. In addition, the authors also evaluated the performance of different LLMs in generating and executing ML - related code, demonstrating their versatility as tool creators, users, and tools themselves, thereby accelerating the reaction discovery and optimization processes in chemical research. Through this method, the research team successfully combined the advantages of ML and LLM, providing a powerful and general - purpose path for advancing synthetic chemistry research.

Integrating Machine Learning and Large Language Models to Advance Exploration of Electrochemical Reactions

Integrating Machine Learning and Large Language Models to Advance Wu Exploration of Electrochemical Reactions

Machine Learning-Guided Yield Optimization for Palladaelectro-Catalyzed Annulation Reaction

Machine-Learning-Guided Discovery of Electrochemical Reactions

Automated electrosynthesis reaction mining with multimodal large language models (MLLMs)

Theoretical Calculation Assisted by Machine Learning Accelerate Optimal Electrocatalyst Finding for Hydrogen Evolution Reaction

Machine Learning Enables a Top-Down Approach to Mechanistic Elucidation

Automation and Machine Learning Augmented by Large Language Models in Catalysis Study

How Machine Learning Can Accelerate Electrocatalysis Discovery and Optimization

Developing General Reactive Element-Based Machine Learning Potentials as the Main Computational Engine for Heterogeneous Catalysis

When machine learning meets molecular synthesis

Unlocking the potential: machine learning applications in electrocatalyst design for electrochemical hydrogen energy transformation

ReLM: Leveraging Language Models for Enhanced Chemical Reaction Prediction

Fine-tuning Large Language Models for Chemical Text Mining

Harnessing Electro-Descriptors for Mechanistic and Machine Learning Analysis of Photocatalytic Organic Reactions.

Combination of Rapid Intrinsic Activity Measurements and Machine Learning as a Screening Approach for Multicomponent Electrocatalysts

Toward Excellence of Electrocatalyst Design by Emerging Descriptor‐Oriented Machine Learning

An Automatic End-to-end Chemical Synthesis Development Platform Powered by Large Language Models

Predicting Reaction Yields via Supervised Learning