Integrating Machine Learning and Large Language Models to Advance Exploration of Electrochemical Reactions

Klavs F Jensen,Zhiling Zheng,Federico Florit,Brooke Jin,Haoyang Wu,Shih-Cheng Li ,Kakasaheb Y. Nandiwale ,Chase A. Salazar,Jason G. Mustakis,William H. Green
DOI: https://doi.org/10.26434/chemrxiv-2024-pk105-v2
2024-08-29
Abstract:Electrochemical C-H oxidation reactions offer a sustainable route to functionalize hydrocarbons, yet the identification of competent substrates and their synthesis optimization remains challenging. Here, we report an integrated approach combining machine learning (ML) and large language models (LLMs) to streamline the exploration of electrochemical C-H oxidation reactions. Utilizing a batch rapid screening electrochemical platform, we evaluated a wide range of reactions, initially classifying substrates by their reactivity, while LLMs text-mined literature data to augment the training set. The resulting ML models, one for reactivity prediction and the other one for site selectivity, both achieved high accuracy (>90%) and enabled virtual screening of a large set of commercially available molecules. To optimize reaction conditions of substrates of interest upon the screening, LLMs were prompted to generate code to iteratively improve yield, lowering the barrier for scientists to access ML programs, and this strategy efficiently identified high-yield conditions for eight drug-like substances or intermediates. Notably, we benchmarked the accuracy and reliability of 10 different LLMs, including llama, Claude, and GPT-4, on generating and executing codes related to ML based on natural language prompts given by chemists to showcase their tool-making and tool-using capabilities and potentials for accelerating research across four diverse tasks. In addition, we collected an experimental benchmark dataset comprising 1071 reaction conditions and yields for electrochemical C-H oxidation reactions, and our findings revealed that integrating LLMs and ML outperformed using either method alone. We envision that this combined approach offers a robust and generalizable pathway for advancing synthetic chemistry research
Chemistry
What problem does this paper attempt to address?
This paper attempts to solve two main problems in electrochemical C - H oxidation reactions: 1. **Identifying suitable substrates**: Electrochemical C - H oxidation reactions provide a sustainable route for functionalizing hydrocarbons, but selecting substrates that can effectively participate in the reaction remains a challenge. Specifically, researchers need to determine which compounds are suitable for electrochemical oxidation. 2. **Optimizing synthesis conditions**: Even if suitable substrates are found, how to optimize their synthesis conditions to obtain the best yield is also a complex problem. The traditional trial - and - error method is not only time - consuming but also resource - intensive, so a more intelligent workflow is required to accelerate this process. To address these challenges, the authors propose a method that combines machine learning (ML) and large - language models (LLM), aiming to improve the research on electrochemical C - H oxidation reactions in the following ways: - **Data collection and screening**: Use a rapid - screening electrochemical platform to evaluate a series of reactions and supplement the training set through literature mining to ensure that the data set contains both positive and negative samples, thereby improving the accuracy of the prediction model. - **Constructing prediction models**: Develop two ML models, one for predicting the reactivity of substrates and the other for predicting the selectivity of specific carbon atoms. Both models achieve an accuracy rate of over 90%, making it possible to virtually screen a large number of commercial molecules. - **Optimizing reaction conditions**: For the screened substrates of interest, use an active - learning protocol to iteratively optimize the reaction conditions and quickly find high - yield synthesis schemes. In addition, the authors also evaluated the performance of different LLMs in generating and executing ML - related code, demonstrating their versatility as tool creators, users, and tools themselves, thereby accelerating the reaction discovery and optimization processes in chemical research. Through this method, the research team successfully combined the advantages of ML and LLM, providing a powerful and general - purpose path for advancing synthetic chemistry research.