Integrating Machine Learning and Large Language Models to Advance Wu Exploration of Electrochemical Reactions

Klavs F Jensen,Zhiling Zheng,Federico Florit,Brooke Jin,Haoyang Wu,Shih-Cheng Li ,Kakasaheb Y. Nandiwale ,Chase A. Salazar,Jason G. Mustakis,William H. Green
DOI: https://doi.org/10.26434/chemrxiv-2024-pk105
2024-08-28
Abstract:Electrochemical C-H oxidation reactions offer a sustainable route to functionalize hydrocarbons, yet the identification of competent substrates and their synthesis optimization remains challenging. Here, we report an integrated approach combining machine learning (ML) and large language models (LLMs) to streamline the exploration of electrochemical C-H oxidation reactions. Utilizing a batch rapid screening electrochemical platform, we evaluated a wide range of reactions, initially classifying substrates by their reactivity, while LLMs text-mined literature data to augment the training set. The resulting ML models, one for reactivity prediction and the other one for site selectivity, both achieved high accuracy (>90%) and enabled virtual screening of a large set of commercially available molecules. To optimize reaction conditions of substrates of interest upon the screening, LLMs were prompted to generate code to iteratively improve yield, lowering the barrier for scientists to access ML programs, and this strategy efficiently identified high-yield conditions for eight drug-like substances or intermediates. Notably, we benchmarked the accuracy and reliability of 10 different LLMs, including llama, Claude, and GPT-4, on generating and executing codes related to ML based on natural language prompts given by chemists to showcase their tool-making and tool-using capabilities and potentials for accelerating research across four diverse tasks. In addition, we collected an experimental benchmark dataset comprising 1071 reaction conditions and yields for electrochemical C-H oxidation reactions, and our findings revealed that integrating LLMs and ML outperformed using either method alone. We envision that this combined approach offers a robust and generalizable pathway for advancing synthetic chemistry research
Chemistry
What problem does this paper attempt to address?
This paper attempts to solve two main problems in electrochemical C - H oxidation reactions: 1. **Identifying suitable substrates**: How to screen out compounds suitable for electrochemical C - H oxidation reactions from a large number of potential substrates. Traditional experimental methods require a great deal of trial and error, which are time - consuming and resource - intensive. Therefore, researchers hope to use machine learning (ML) and large - language models (LLMs) to predict which substrates are suitable for this reaction. 2. **Optimizing reaction conditions**: For selected substrates, how to quickly find the optimal reaction conditions to increase the yield. Traditional methods also rely on a large number of experimental attempts and are inefficient. Researchers hope to combine ML and LLMs and use active - learning strategies to iteratively optimize reaction conditions, thereby reducing the number of experiments and increasing efficiency. ### Specific methods - **Data collection and annotation**: - A 24 - well - plate rapid - screening electrochemical platform was developed to evaluate the reactivity of multiple substrates, and the products were confirmed by NMR spectroscopy. - LLMs were used to mine relevant reaction data from the literature to supplement the experimental data set, ensuring the diversity and balance of the data. - **Machine - learning model training**: - Two ML models were constructed: one for predicting the reactivity of substrates (whether electrochemical C - H oxidation can occur), and the other for predicting selectivity (which carbon atom will be specifically oxidized). - Model training was based on experimental data and literature data. After optimization, the accuracy of both models exceeded 90%, and the AUC values were 97.2% and 98.1% respectively. - **Code generation and automation**: - LLMs were used to automatically generate code to help synthetic chemists process data, optimize reaction conditions, and directly control laboratory equipment (such as liquid - handling robots). - By benchmarking 10 different LLMs to evaluate their performance on different tasks, the results showed that LLMs have high accuracy and reliability in code generation. - **Active - learning optimization**: - An active - learning strategy was applied to gradually optimize reaction conditions through iterative experiments, ultimately achieving a significant increase in yield. For example, in the electrochemical oxidation reaction of α - pinene, the random method could only achieve a yield of about 20%, while the method combining LLM and ML achieved a yield of over 60%. ### Summary This study demonstrates the synergistic potential of ML and LLMs in electrochemical C - H oxidation reactions, which not only improves the efficiency of substrate screening and reaction - condition optimization but also provides a new paradigm for future synthetic - chemistry research. This method can accelerate the discovery of new reactions and reduce the dependence on the experience of experimental personnel.