Catalyst Property Prediction with CatBERTa: Unveiling Feature Exploration Strategies through Large Language Models

Janghoon Ock,Chakradhar Guntuboina,Amir Barati Farimani

2023-09-02

Abstract:Efficient catalyst screening necessitates predictive models for adsorption energy, a key property of reactivity. However, prevailing methods, notably graph neural networks (GNNs), demand precise atomic coordinates for constructing graph representations, while integrating observable attributes remains challenging. This research introduces CatBERTa, an energy prediction Transformer model using textual inputs. Built on a pretrained Transformer encoder, CatBERTa processes human-interpretable text, incorporating target features. Attention score analysis reveals CatBERTa's focus on tokens related to adsorbates, bulk composition, and their interacting atoms. Moreover, interacting atoms emerge as effective descriptors for adsorption configurations, while factors such as bond length and atomic properties of these atoms offer limited predictive contributions. By predicting adsorption energy from the textual representation of initial structures, CatBERTa achieves a mean absolute error (MAE) of 0.75 eV-comparable to vanilla Graph Neural Networks (GNNs). Furthermore, the subtraction of the CatBERTa-predicted energies effectively cancels out their systematic errors by as much as 19.3% for chemically similar systems, surpassing the error reduction observed in GNNs. This outcome highlights its potential to enhance the accuracy of energy difference predictions. This research establishes a fundamental framework for text-based catalyst property prediction, without relying on graph representations, while also unveiling intricate feature-property relationships.

Computational Engineering, Finance, and Science,Chemical Physics

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the challenges in predicting adsorption energy during the catalyst screening process. Specifically, the authors propose a novel model named **CatBERTa**, which utilizes text - based input to predict adsorption energy, thus avoiding the dependence on precise atomic coordinates in traditional methods (such as Graph Neural Networks, GNNs). #### Main problems: 1. **The need for efficient catalyst screening**: Traditional experimental and computational methods (such as Density Functional Theory, DFT) are accurate but time - consuming and resource - intensive, making it difficult to quickly evaluate a large number of catalyst - adsorbate combinations. 2. **Limitations of existing methods**: - **Graph Neural Networks (GNNs)**: They require precise atomic coordinates to construct graph representations, and it is difficult to integrate observable properties. Also, they have difficulties in explaining the influence of specific physical properties. - **3D structure dependence**: Many existing methods rely on precise 3D structure information, which may not be easily obtainable in the early screening stage. #### Solutions: - **CatBERTa model**: It is a Transformer - based deep - learning model that predicts adsorption energy by processing human - interpretable text descriptions. It can predict adsorption energy from the text representation of the initial structure without relying on precise 3D atomic coordinates. - **Feature exploration strategy**: By analyzing the attention scores, it reveals the degree of the model's attention to different input features, especially those related to adsorbates, bulk composition, and their interacting atoms. - **Performance improvement**: CatBERTa has achieved an accuracy comparable to that of traditional GNNs in predicting adsorption energy (with a Mean Absolute Error, MAE, of 0.75 eV), and it shows better error reduction in chemically similar systems (up to 19.3% reduction). #### Research contributions: - **Providing a new framework**: For text - based prediction of catalyst properties, without relying on graph representations, while also revealing complex feature - property relationships. - **Enhancing model interpretability**: By integrating human - interpretable features and analyzing attention scores, it helps researchers better understand key catalyst properties. In conclusion, this paper solves the problem of relying on precise 3D structure information in existing catalyst screening methods by introducing the CatBERTa model, and provides a more efficient and interpretable method for predicting adsorption energy.

Catalyst Property Prediction with CatBERTa: Unveiling Feature Exploration Strategies through Large Language Models

Multimodal Language and Graph Learning of Adsorption Configuration in Catalysis

AdsMT: A multi-modal transformer for predicting global minimum adsorption energy

Generative Language Model for Catalyst Discovery

Catlas: an automated framework for catalyst discovery demonstrated for direct syngas conversion

CatTSunami: Accelerating Transition State Energy Calculations with Pre-trained Graph Neural Networks

HCat-GNet: An Interpretable Graph Neural Network for Catalysis Optimization

Explainable Data-driven Modeling of Adsorption Energy in Heterogeneous Catalysis

Adsorb-Agent: Autonomous Identification of Stable Adsorption Configurations via Large Language Model Agent

Adsorption Enthalpies for Catalysis Modeling through Machine-Learned Descriptors

MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction

Predicting binding motifs of complex adsorbates using machine learning with a physics-inspired graph representation

Examining Generalizability of AI Models for Catalysis

ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction

Lightweight Geometric Deep Learning for Molecular Modelling in Catalyst Discovery

AdsorbML: A Leap in Efficiency for Adsorption Energy Calculations using Generalizable Machine Learning Potentials

PhAST: Physics-Aware, Scalable, and Task-specific GNNs for Accelerated Catalyst Design

Designing catalysts with deep generative models and computational data. A case study for Suzuki cross coupling reactions

Adaptive Catalyst Discovery Using Multicriteria Bayesian Optimization with Representation Learning

Single-atom catalysts property prediction via Supervised and Self-Supervised pre-training models

Data-Driven Prediction of Configurational Stability of Molecule-Adsorbed Heterogeneous Catalysts