Abstract:Discovery of novel and promising materials is a critical challenge in the field of chemistry and material science, traditionally approached through methodologies ranging from trial-and-error to machine learning-driven inverse design. Recent studies suggest that transformer-based language models can be utilized as material generative models to expand chemical space and explore materials with desired properties. In this work, we introduce the Catalyst Generative Pretrained Transformer (CatGPT), trained to generate string representations of inorganic catalyst structures from a vast chemical space. CatGPT not only demonstrates high performance in generating valid and accurate catalyst structures but also serves as a foundation model for generating desired types of catalysts by fine-tuning with sparse and specified datasets. As an example, we fine-tuned the pretrained CatGPT using a binary alloy catalyst dataset designed for screening two-electron oxygen reduction reaction (2e-ORR) catalyst and generate catalyst structures specialized for 2e-ORR. Our work demonstrates the potential of language models as generative tools for catalyst discovery.

What problem does this paper attempt to address?

The main goal of this paper is to develop a catalyst discovery method based on generative language models, specifically, by using a pre-trained model with a Transformer architecture (named CatGPT) to generate inorganic catalyst structures. The method proposed in the paper aims to overcome the limitations and high costs of traditional material discovery methods (such as trial-and-error or high-throughput virtual screening) by exploring a broader chemical space through a machine learning-driven inverse design strategy. More specifically, the CatGPT model learns from a vast dataset of catalysts and is capable of generating effective and precise catalyst structures. By fine-tuning the pre-trained CatGPT model, it can be customized to generate specific types of catalysts. For example, the paper demonstrates the fine-tuning of the model using a binary alloy catalyst dataset to specifically generate catalysts for the two-electron oxygen reduction reaction (2e-ORR). To evaluate the effectiveness and quality of the generated catalyst structures, the authors developed an anomaly detection model to check the rationality of the generated structures and defined multiple evaluation metrics, including generation effectiveness, structural validity, catalyst effectiveness, and coverage, among others. Additionally, a variant called CatGPT-BP was introduced, which avoids the issue of atomic overlap when converting string representations into three-dimensional structures, thereby increasing the structural validity. By fine-tuning and applying different generation strategies (such as changing temperature parameters and input queries), CatGPT-BP is able to generate catalyst structures with reasonable validity and diversity. Finally, the paper demonstrates the process of validating the generated catalyst structures using machine learning potentials (MLP) and density functional theory (DFT) calculations, confirming the model's potential in discovering new types of catalysts, especially when fine-tuned with relatively smaller datasets. In summary, this study demonstrates the practicality of generative language models as tools for catalyst discovery, particularly in generating new catalysts that meet specific performance requirements. However, it also points out some challenges of the current method, such as the difficulty in recovering complete crystal information from the generated catalyst structures and the model's limitations in extending element diversity, which provides directions for future research.

Generative Language Model for Catalyst Discovery

Probabilistic generative transformer language models for generative design of molecules

Designing catalysts with deep generative models and computational data. A case study for Suzuki cross coupling reactions

Generative Chemical Transformer: Neural Machine Learning of Molecular Geometric Structures from Chemical Language via Attention

Open Challenges in Developing Generalizable Large Scale Machine Learning Models for Catalyst Discovery

Catalyst Property Prediction with CatBERTa: Unveiling Feature Exploration Strategies through Large Language Models

Materials Transformers Language Models for Generative Materials Design: a benchmark study

Open Challenges in Developing Generalizable Large-Scale Machine-Learning Models for Catalyst Discovery

Crystal Transformer: Self-learning neural language model for Generative and Tinkering Design of Materials

cMolGPT: A Conditional Generative Pre-Trained Transformer for Target-Specific De Novo Molecular Generation

Computational catalyst discovery: Active classification through myopic multiscale sampling

Catlas: an automated framework for catalyst discovery demonstrated for direct syngas conversion

Adapt-cMolGPT: A Conditional Generative Pre-Trained Transformer with Adapter-Based Fine-Tuning for Target-Specific Molecular Generation

AtomGPT: Atomistic Generative Pre-trained Transformer for Forward and Inverse Materials Design

Crystal Composition Transformer: Self-Learning Neural Language Model for Generative and Tinkering Design of Materials

Enhancing catalysis studies with chat generative pre-trained transformer (ChatGPT): Conversation with ChatGPT

CataLM: Empowering Catalyst Design Through Large Language Models

MatterGPT: A Generative Transformer for Multi-Property Inverse Design of Solid-State Materials

HCat-GNet: An Interpretable Graph Neural Network for Catalysis Optimization

Health care workers: roll up your sleeve. Campaigns take aim at seasonal influenza, H1N1.

Bayesian Optimization of Catalysts With In-context Learning