Large language model enhanced corpus of CO 2 reduction electrocatalysts and synthesis procedures

Xueqing Chen,Yang Gao,Ludi Wang,Wenjuan Cui,Jiamin Huang,Yi Du,Bin Wang

DOI: https://doi.org/10.1038/s41597-024-03180-9

2024-04-07

Scientific Data

Abstract:CO 2 electroreduction has garnered significant attention from both the academic and industrial communities. Extracting crucial information related to catalysts from domain literature can help scientists find new and effective electrocatalysts. Herein, we used various advanced machine learning, natural language processing techniques and large language models (LLMs) approaches to extract relevant information about the CO 2 electrocatalytic reduction process from scientific literature. By applying the extraction pipeline, we present an open-source corpus for electrocatalytic CO 2 reduction. The database contains two types of corpus: (1) the benchmark corpus, which is a collection of 6,985 records extracted from 1,081 publications by catalysis postgraduates; and (2) the extended corpus, which consists of content extracted from 5,941 documents using traditional NLP techniques and LLMs techniques. The Extended Corpus I and II contain 77,016 and 30,283 records, respectively. Furthermore, several domain literature fine-tuned LLMs were developed. Overall, this work will contribute to the exploration of new and effective electrocatalysts by leveraging information from domain literature using cutting-edge computer techniques.

multidisciplinary sciences

What problem does this paper attempt to address?

The problem this paper attempts to address is: How to utilize advanced machine learning, natural language processing techniques, and large language models (LLMs) to extract key information related to the electrocatalytic reduction of carbon dioxide from scientific literature, in order to accelerate the development of efficient electrocatalysts. Specifically, this study aims to: 1. **Construct an open-source corpus for electrocatalytic carbon dioxide reduction**: By extracting relevant information from scientific literature, establish a comprehensive database containing information on catalyst composition, synthesis methods, regulation means, and performance. 2. **Improve the accuracy and efficiency of information extraction**: Traditional methods such as named entity recognition (NER) have limitations in extracting complex and heterogeneous information. This study attempts to use large language models and prompt engineering to improve the accuracy and efficiency of information extraction. 3. **Provide structured data resources**: The generated corpus can provide detailed guidance for materials scientists, helping them develop new electrocatalysts, reducing the time and resources spent on literature review and data collection, and allowing scientists to focus more on innovation and experimentation. Through these efforts, this study hopes to significantly accelerate research progress in the field of carbon dioxide electrocatalytic reduction, providing support for effectively mitigating greenhouse gas emissions and producing fuels and chemicals.

Large language model enhanced corpus of CO 2 reduction electrocatalysts and synthesis procedures

Deep learning of electrochemical CO2 conversion literature reveals research trends and directions

Revisiting Electrocatalyst Design by a Knowledge Graph of Cu-Based Catalysts for CO 2 Reduction

CataLM: Empowering Catalyst Design Through Large Language Models

Integrating Machine Learning and Large Language Models to Advance Exploration of Electrochemical Reactions

Integrating Machine Learning and Large Language Models to Advance Wu Exploration of Electrochemical Reactions

Machine Learning in Screening High Performance Electrocatalysts for CO 2 Reduction

Automation and Machine Learning Augmented by Large Language Models in Catalysis Study

Correlation of HER-2/neu overexpression with mammography and age distribution in primary breast carcinomas.

Recent advances in the theoretical studies on the electrocatalytic CO2 reduction based on single and double atoms

Leveraging Data Mining, Active Learning, and Domain Adaptation in a Multi-Stage, Machine Learning-Driven Approach for the Efficient Discovery of Advanced Acidic Oxygen Evolution Electrocatalysts

Machine Learning Big Data Set Analysis Reveals C-C Electro-Coupling Mechanism

Electrolyzer and Catalysts Design from Carbon Dioxide to Carbon Monoxide Electrochemical Reduction

Unlocking New Insights for Electrocatalyst Design: A Unique Data Science Workflow Leveraging Internet-Sourced Big Data

Effect of electrolyte cation-mediated mechanism on electrocatalytic carbon dioxide reduction

Harnessing Large Language Model to collect and analyze Metal-organic framework property dataset

A Machine Learning Model on Simple Features for CO2 Reduction Electrocatalysts

Fine-tuning Large Language Models for Chemical Text Mining

Machine-Learning-Augmented Chemisorption Model for CO2 Electroreduction Catalyst Screening.

A document-level information extraction pipeline for layered cathode materials for sodium-ion batteries

Machine Learning Across Metal and Carbon Support for the Screening of Efficient Atomic Catalysts Toward CO2 Reduction