Catalyst Property Prediction with CatBERTa: Unveiling Feature Exploration Strategies through Large Language Models

Janghoon Ock,Chakradhar Guntuboina,Amir Barati Farimani
2023-09-02
Abstract:Efficient catalyst screening necessitates predictive models for adsorption energy, a key property of reactivity. However, prevailing methods, notably graph neural networks (GNNs), demand precise atomic coordinates for constructing graph representations, while integrating observable attributes remains challenging. This research introduces CatBERTa, an energy prediction Transformer model using textual inputs. Built on a pretrained Transformer encoder, CatBERTa processes human-interpretable text, incorporating target features. Attention score analysis reveals CatBERTa's focus on tokens related to adsorbates, bulk composition, and their interacting atoms. Moreover, interacting atoms emerge as effective descriptors for adsorption configurations, while factors such as bond length and atomic properties of these atoms offer limited predictive contributions. By predicting adsorption energy from the textual representation of initial structures, CatBERTa achieves a mean absolute error (MAE) of 0.75 eV-comparable to vanilla Graph Neural Networks (GNNs). Furthermore, the subtraction of the CatBERTa-predicted energies effectively cancels out their systematic errors by as much as 19.3% for chemically similar systems, surpassing the error reduction observed in GNNs. This outcome highlights its potential to enhance the accuracy of energy difference predictions. This research establishes a fundamental framework for text-based catalyst property prediction, without relying on graph representations, while also unveiling intricate feature-property relationships.
Computational Engineering, Finance, and Science,Chemical Physics
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address the challenges in predicting adsorption energy during the catalyst screening process. Specifically, the authors propose a novel model named **CatBERTa**, which utilizes text - based input to predict adsorption energy, thus avoiding the dependence on precise atomic coordinates in traditional methods (such as Graph Neural Networks, GNNs). #### Main problems: 1. **The need for efficient catalyst screening**: Traditional experimental and computational methods (such as Density Functional Theory, DFT) are accurate but time - consuming and resource - intensive, making it difficult to quickly evaluate a large number of catalyst - adsorbate combinations. 2. **Limitations of existing methods**: - **Graph Neural Networks (GNNs)**: They require precise atomic coordinates to construct graph representations, and it is difficult to integrate observable properties. Also, they have difficulties in explaining the influence of specific physical properties. - **3D structure dependence**: Many existing methods rely on precise 3D structure information, which may not be easily obtainable in the early screening stage. #### Solutions: - **CatBERTa model**: It is a Transformer - based deep - learning model that predicts adsorption energy by processing human - interpretable text descriptions. It can predict adsorption energy from the text representation of the initial structure without relying on precise 3D atomic coordinates. - **Feature exploration strategy**: By analyzing the attention scores, it reveals the degree of the model's attention to different input features, especially those related to adsorbates, bulk composition, and their interacting atoms. - **Performance improvement**: CatBERTa has achieved an accuracy comparable to that of traditional GNNs in predicting adsorption energy (with a Mean Absolute Error, MAE, of 0.75 eV), and it shows better error reduction in chemically similar systems (up to 19.3% reduction). #### Research contributions: - **Providing a new framework**: For text - based prediction of catalyst properties, without relying on graph representations, while also revealing complex feature - property relationships. - **Enhancing model interpretability**: By integrating human - interpretable features and analyzing attention scores, it helps researchers better understand key catalyst properties. In conclusion, this paper solves the problem of relying on precise 3D structure information in existing catalyst screening methods by introducing the CatBERTa model, and provides a more efficient and interpretable method for predicting adsorption energy.