Large language model enhanced corpus of CO 2 reduction electrocatalysts and synthesis procedures

Xueqing Chen,Yang Gao,Ludi Wang,Wenjuan Cui,Jiamin Huang,Yi Du,Bin Wang
DOI: https://doi.org/10.1038/s41597-024-03180-9
2024-04-07
Scientific Data
Abstract:CO 2 electroreduction has garnered significant attention from both the academic and industrial communities. Extracting crucial information related to catalysts from domain literature can help scientists find new and effective electrocatalysts. Herein, we used various advanced machine learning, natural language processing techniques and large language models (LLMs) approaches to extract relevant information about the CO 2 electrocatalytic reduction process from scientific literature. By applying the extraction pipeline, we present an open-source corpus for electrocatalytic CO 2 reduction. The database contains two types of corpus: (1) the benchmark corpus, which is a collection of 6,985 records extracted from 1,081 publications by catalysis postgraduates; and (2) the extended corpus, which consists of content extracted from 5,941 documents using traditional NLP techniques and LLMs techniques. The Extended Corpus I and II contain 77,016 and 30,283 records, respectively. Furthermore, several domain literature fine-tuned LLMs were developed. Overall, this work will contribute to the exploration of new and effective electrocatalysts by leveraging information from domain literature using cutting-edge computer techniques.
multidisciplinary sciences
What problem does this paper attempt to address?
The problem this paper attempts to address is: How to utilize advanced machine learning, natural language processing techniques, and large language models (LLMs) to extract key information related to the electrocatalytic reduction of carbon dioxide from scientific literature, in order to accelerate the development of efficient electrocatalysts. Specifically, this study aims to: 1. **Construct an open-source corpus for electrocatalytic carbon dioxide reduction**: By extracting relevant information from scientific literature, establish a comprehensive database containing information on catalyst composition, synthesis methods, regulation means, and performance. 2. **Improve the accuracy and efficiency of information extraction**: Traditional methods such as named entity recognition (NER) have limitations in extracting complex and heterogeneous information. This study attempts to use large language models and prompt engineering to improve the accuracy and efficiency of information extraction. 3. **Provide structured data resources**: The generated corpus can provide detailed guidance for materials scientists, helping them develop new electrocatalysts, reducing the time and resources spent on literature review and data collection, and allowing scientists to focus more on innovation and experimentation. Through these efforts, this study hopes to significantly accelerate research progress in the field of carbon dioxide electrocatalytic reduction, providing support for effectively mitigating greenhouse gas emissions and producing fuels and chemicals.