Abstract:Catalyst discovery and optimization is key to solving many societal and energy challenges including solar fuels synthesis, long-term energy storage, and renewable fertilizer production. Despite considerable effort by the catalysis community to apply machine learning models to the computational catalyst discovery process, it remains an open challenge to build models that can generalize across both elemental compositions of surfaces and adsorbate identity/configurations, perhaps because datasets have been smaller in catalysis than related fields. To address this we developed the OC20 dataset, consisting of 1,281,040 Density Functional Theory (DFT) relaxations (~264,890,000 single point evaluations) across a wide swath of materials, surfaces, and adsorbates (nitrogen, carbon, and oxygen chemistries). We supplemented this dataset with randomly perturbed structures, short timescale molecular dynamics, and electronic structure analyses. The dataset comprises three central tasks indicative of day-to-day catalyst modeling and comes with pre-defined train/validation/test splits to facilitate direct comparisons with future model development efforts. We applied three state-of-the-art graph neural network models (CGCNN, SchNet, Dimenet++) to each of these tasks as baseline demonstrations for the community to build on. In almost every task, no upper limit on model size was identified, suggesting that even larger models are likely to improve on initial results. The dataset and baseline models are both provided as open resources, as well as a public leader board to encourage community contributions to solve these important tasks.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are the computational efficiency and generalization ability in catalyst discovery and optimization. Specifically, catalysts play a crucial role in many social and energy challenges, such as solar fuel synthesis, long - term energy storage, and renewable fertilizer production. However, although the catalysis community has made great efforts to apply machine - learning models to the computational catalyst discovery process, building models that can be widely applicable to surfaces with different elemental compositions and adsorbate identities/configurations remains an open challenge. This may be because the data sets in the catalysis field are relatively small and cannot fully train the models to achieve good generalization performance. To address this challenge, the authors developed the Open Catalyst 2020 (OC20) data set, which contains more than 1,281,040 density functional theory (DFT) relaxation calculations (approximately 264.89 million single - point evaluations), covering a wide range of materials, surfaces, and adsorbates (nitrogen, carbon, and oxygen chemistry). By providing a large - scale data set, the authors aim to promote the development of more efficient and generalized machine - learning models that can better predict the behavior of catalysts, thereby accelerating the discovery and optimization process of new materials. The paper also proposes three related domain challenges as open competitions, namely: 1. **Structure - to - Energy - and - Force (S2EF)**: Given the positions of atoms, predict the energy calculated by DFT and the force on each atom. 2. **Initial - Structure - to - Relaxed - Structure (IS2RS)**: Given the initial structure, predict the positions of atoms in their final relaxed state. 3. **Initial - Structure - to - Relaxed - Energy (IS2RE)**: Given the initial structure, predict the energy of the structure in the relaxed state. These tasks aim to improve the efficiency of inorganic and organic interface simulations, especially for the basic calculation of structure relaxation, because calculating the forces and energies of structures by DFT is the main computational bottleneck. Through these tasks, the authors hope to promote the development of machine - learning models so that they can not only handle the current data sets but also perform well when facing larger - scale and more diverse data sets in the future.

The Open Catalyst 2020 (OC20) Dataset and Community Challenges

The Open Catalyst 2022 (OC22) Dataset and Challenges for Oxide Electrocatalysts

Open Catalyst Experiments 2024 (OCx24): Bridging Experiments and Computational Models

Open Challenges in Developing Generalizable Large Scale Machine Learning Models for Catalyst Discovery

Open Challenges in Developing Generalizable Large-Scale Machine-Learning Models for Catalyst Discovery

The Open Catalyst Challenge 2021: Competition Report.

An Introduction to Electrocatalyst Design using Machine Learning for Renewable Energy Storage

CatTSunami: Accelerating Transition State Energy Calculations with Pre-trained Graph Neural Networks

Computational catalyst discovery: Active classification through myopic multiscale sampling

Examining Generalizability of AI Models for Catalysis

Catalysis distillation neural network for the few shot open catalyst challenge

Catlas: an automated framework for catalyst discovery demonstrated for direct syngas conversion

Explainable Data-driven Modeling of Adsorption Energy in Heterogeneous Catalysis

The Open MatSci ML Toolkit: A Flexible Framework for Machine Learning in Materials Science

Boosting Heterogeneous Catalyst Discovery by Structurally Constrained Deep Learning Models

Machine-learning-accelerated Discovery of Single-Atom Catalysts Based on Bidirectional Activation Mechanism

Automatic graph representation algorithm for heterogeneous catalysis

PhAST: Physics-Aware, Scalable, and Task-specific GNNs for Accelerated Catalyst Design

Lightweight Geometric Deep Learning for Molecular Modelling in Catalyst Discovery

Digitization in Catalysis Research: Towards a Holistic Description of a Ni/Al2O3 Reference Catalyst for CO2 Methanation

Rational design of nanoscale stabilized oxide catalysts for OER with OC22