The Open Catalyst 2020 (OC20) Dataset and Community Challenges

Lowik Chanussot,Abhishek Das,Siddharth Goyal,Thibaut Lavril,Muhammed Shuaibi,Morgane Riviere,Kevin Tran,Javier Heras-Domingo,Caleb Ho,Weihua Hu,Aini Palizhati,Anuroop Sriram,Brandon Wood,Junwoong Yoon,Devi Parikh,C. Lawrence Zitnick,Zachary Ulissi
DOI: https://doi.org/10.1021/acscatal.0c04525
2021-09-24
Abstract:Catalyst discovery and optimization is key to solving many societal and energy challenges including solar fuels synthesis, long-term energy storage, and renewable fertilizer production. Despite considerable effort by the catalysis community to apply machine learning models to the computational catalyst discovery process, it remains an open challenge to build models that can generalize across both elemental compositions of surfaces and adsorbate identity/configurations, perhaps because datasets have been smaller in catalysis than related fields. To address this we developed the OC20 dataset, consisting of 1,281,040 Density Functional Theory (DFT) relaxations (~264,890,000 single point evaluations) across a wide swath of materials, surfaces, and adsorbates (nitrogen, carbon, and oxygen chemistries). We supplemented this dataset with randomly perturbed structures, short timescale molecular dynamics, and electronic structure analyses. The dataset comprises three central tasks indicative of day-to-day catalyst modeling and comes with pre-defined train/validation/test splits to facilitate direct comparisons with future model development efforts. We applied three state-of-the-art graph neural network models (CGCNN, SchNet, Dimenet++) to each of these tasks as baseline demonstrations for the community to build on. In almost every task, no upper limit on model size was identified, suggesting that even larger models are likely to improve on initial results. The dataset and baseline models are both provided as open resources, as well as a public leader board to encourage community contributions to solve these important tasks.
Materials Science,Machine Learning
What problem does this paper attempt to address?
The problems that this paper attempts to solve are the computational efficiency and generalization ability in catalyst discovery and optimization. Specifically, catalysts play a crucial role in many social and energy challenges, such as solar fuel synthesis, long - term energy storage, and renewable fertilizer production. However, although the catalysis community has made great efforts to apply machine - learning models to the computational catalyst discovery process, building models that can be widely applicable to surfaces with different elemental compositions and adsorbate identities/configurations remains an open challenge. This may be because the data sets in the catalysis field are relatively small and cannot fully train the models to achieve good generalization performance. To address this challenge, the authors developed the Open Catalyst 2020 (OC20) data set, which contains more than 1,281,040 density functional theory (DFT) relaxation calculations (approximately 264.89 million single - point evaluations), covering a wide range of materials, surfaces, and adsorbates (nitrogen, carbon, and oxygen chemistry). By providing a large - scale data set, the authors aim to promote the development of more efficient and generalized machine - learning models that can better predict the behavior of catalysts, thereby accelerating the discovery and optimization process of new materials. The paper also proposes three related domain challenges as open competitions, namely: 1. **Structure - to - Energy - and - Force (S2EF)**: Given the positions of atoms, predict the energy calculated by DFT and the force on each atom. 2. **Initial - Structure - to - Relaxed - Structure (IS2RS)**: Given the initial structure, predict the positions of atoms in their final relaxed state. 3. **Initial - Structure - to - Relaxed - Energy (IS2RE)**: Given the initial structure, predict the energy of the structure in the relaxed state. These tasks aim to improve the efficiency of inorganic and organic interface simulations, especially for the basic calculation of structure relaxation, because calculating the forces and energies of structures by DFT is the main computational bottleneck. Through these tasks, the authors hope to promote the development of machine - learning models so that they can not only handle the current data sets but also perform well when facing larger - scale and more diverse data sets in the future.