Applying Large Graph Neural Networks to Predict Transition Metal Complex Energies Using the tmQM_wB97MV Dataset

Aaron Garrison,Javier Heras-Domingo,John Kitchin,Gabriel Gomes,Zachary Ulissi,Samuel Blau

DOI: https://doi.org/10.26434/chemrxiv-2023-4m3rt-v2

2023-11-09

Abstract:Machine learning (ML) methods have shown promise for discovering novel catalysts, but are often restricted to specific chemical domains. Generalizable ML models require large and diverse training datasets, which exist for heterogeneous catalysis but not for homogeneous catalysis. The tmQM dataset, which contains properties of 86,665 transition metal complexes calculated at the TPSSh/def2-SVP level of density functional theory (DFT), provided a promising training dataset for homogeneous catalyst systems. However, we find that ML models trained on tmQM consistently underpredict the energies of a chemically distinct subset of the data. To address this, we present the tmQM_wB97MV dataset, which filters out several structures in tmQM found to be missing hydrogens and recomputes the energies of all other structures at the wB97M-V/def2-SVPD level of DFT. ML models trained on tmQM_wB97MV show no pattern of consistently incorrect predictions and much lower errors than those trained on tmQM. The ML models tested on tmQM_wB97MV were, from best to worst, GemNet-T > PaiNN ~ SpinConv > SchNet. Performance consistently improves when using only neutral structures instead of the entire dataset. However, while models saturate with only neutral structures, more data continues to improve the models when including charged species, indicating the importance of accurately capturing a range of oxidation states in future data generation and model development. Furthermore, a fine-tuning approach where weights were initialized from models trained on OC20 led to drastic improvements in model performance, indicating transferability between ML strategies of heterogeneous and homogeneous systems.

Chemistry

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issues of inconsistency and accuracy in the energy prediction of transition metal complexes. Specifically: 1. **Dataset Issues**: - **tmQM Dataset**: Although the tmQM dataset contains a large number (86,665) of properties of transition metal complexes, the authors found that machine learning (ML) models perform poorly in predicting data for certain specific subsets, even when these data are included in the training set. This suggests that there may be consistency issues with the original data. 2. **Improved Dataset**: - **tmQM wB97MV Dataset**: To address this issue, the authors recalculated the energies of all structures in the tmQM dataset using a higher precision electronic structure theory (density functional theory at the ωB97M-V/def2-SVPD level) and filtered out some structures with missing hydrogen atoms. The new dataset, tmQM wB97MV, aims to provide more reliable and consistent energy predictions. 3. **Model Performance Evaluation**: - The authors trained and tested several graph neural network (GNN) models (such as SchNet, SpinConv, PaiNN, and GemNet-T) on the new tmQM wB97MV dataset to evaluate model performance. The results show that models trained on the new dataset perform better in predicting energies, with no systematic underestimation and significantly reduced errors. 4. **Transfer Learning**: - The authors also explored the method of fine-tuning pre-trained models (such as models trained on the OC20 dataset) to further improve model performance. The results indicate that this method can significantly enhance model performance. ### Summary The main goal of the paper is to address the issues of inconsistency and accuracy in the energy prediction of transition metal complexes by creating a higher precision dataset (tmQM wB97MV) and validating the effectiveness of the new dataset through training and testing various GNN models. Ultimately, the authors hope that these improvements can provide more reliable tools for catalyst screening.

Applying Large Graph Neural Networks to Predict Transition Metal Complex Energies Using the tmQM_wB97MV Dataset

Machine learning models predict calculation outcomes with the transferability necessary for computational catalysis

Transition1x - a dataset for building generalizable reactive machine learning potentials

Adsorption Enthalpies for Catalysis Modeling through Machine-Learned Descriptors

CatTSunami: Accelerating Transition State Energy Calculations with Pre-trained Graph Neural Networks

Machine Learning for the Expedited Screening of Hydrogen Evolution Catalysts for Transition Metal-Doped Transition Metal Dichalcogenides

Machine Learning for Transition-Metal-Based Hydrogen Generation Electrocatalysts

Using machine learning to go beyond potential energy surface benchmarking for chemical reactivity

Improved accuracy and transferability of molecular-orbital-based machine learning: Organics, transition-metal complexes, non-covalent interactions, and transition states

Beyond potential energy surface benchmarking: a complete application of machine learning to chemical reactivity

Prediction of energies for reaction intermediates and transition states on catalyst surfaces using graph-based machine learning models

Machine Learning for Atomic Simulation and Activity Prediction in Heterogeneous Catalysis: Current Status and Future

Leveraging natural language processing to curate the tmCAT, tmPHOTO, tmBIO, and tmSCO datasets of functional transition metal complexes

Hybrid Quantum Neural Network Model with Catalyst Experimental Validation: Application for the Dry Reforming of Methane

A Universal Machine Learning Framework for Electrocatalyst Innovation: A Case Study of Discovering Alloys for Hydrogen Evolution Reaction

Graph neural networks for predicting metal–ligand coordination of transition metal complexes

Fusing machine learning strategy with density functional theory to hasten the discovery of MXenes for hydrogen generation

Explainable Data-driven Modeling of Adsorption Energy in Heterogeneous Catalysis

Data-efficient modeling of catalytic reactions via enhanced sampling and on-the-fly learning of machine learning potentials

Data-Driven Prediction of Configurational Stability of Molecule-Adsorbed Heterogeneous Catalysts