Abstract:Discovery of novel and promising materials is a critical challenge in the field of chemistry and material science, traditionally approached through methodologies ranging from trial-and-error to machine learning-driven inverse design. Recent studies suggest that transformer-based language models can be utilized as material generative models to expand chemical space and explore materials with desired properties. In this work, we introduce the Catalyst Generative Pretrained Transformer (CatGPT), trained to generate string representations of inorganic catalyst structures from a vast chemical space. CatGPT not only demonstrates high performance in generating valid and accurate catalyst structures but also serves as a foundation model for generating desired types of catalysts by fine-tuning with sparse and specified datasets. As an example, we fine-tuned the pretrained CatGPT using a binary alloy catalyst dataset designed for screening two-electron oxygen reduction reaction (2e-ORR) catalyst and generate catalyst structures specialized for 2e-ORR. Our work demonstrates the potential of language models as generative tools for catalyst discovery.
What problem does this paper attempt to address?
The main goal of this paper is to develop a catalyst discovery method based on generative language models, specifically, by using a pre-trained model with a Transformer architecture (named CatGPT) to generate inorganic catalyst structures. The method proposed in the paper aims to overcome the limitations and high costs of traditional material discovery methods (such as trial-and-error or high-throughput virtual screening) by exploring a broader chemical space through a machine learning-driven inverse design strategy.
More specifically, the CatGPT model learns from a vast dataset of catalysts and is capable of generating effective and precise catalyst structures. By fine-tuning the pre-trained CatGPT model, it can be customized to generate specific types of catalysts. For example, the paper demonstrates the fine-tuning of the model using a binary alloy catalyst dataset to specifically generate catalysts for the two-electron oxygen reduction reaction (2e-ORR).
To evaluate the effectiveness and quality of the generated catalyst structures, the authors developed an anomaly detection model to check the rationality of the generated structures and defined multiple evaluation metrics, including generation effectiveness, structural validity, catalyst effectiveness, and coverage, among others. Additionally, a variant called CatGPT-BP was introduced, which avoids the issue of atomic overlap when converting string representations into three-dimensional structures, thereby increasing the structural validity.
By fine-tuning and applying different generation strategies (such as changing temperature parameters and input queries), CatGPT-BP is able to generate catalyst structures with reasonable validity and diversity. Finally, the paper demonstrates the process of validating the generated catalyst structures using machine learning potentials (MLP) and density functional theory (DFT) calculations, confirming the model's potential in discovering new types of catalysts, especially when fine-tuned with relatively smaller datasets.
In summary, this study demonstrates the practicality of generative language models as tools for catalyst discovery, particularly in generating new catalysts that meet specific performance requirements. However, it also points out some challenges of the current method, such as the difficulty in recovering complete crystal information from the generated catalyst structures and the model's limitations in extending element diversity, which provides directions for future research.