Neural networks trained on synthetically generated crystals can extract structural information from ICSD powder X-ray diffractograms

Henrik Schopmans,Patrick Reiser,Pascal Friederich
DOI: https://doi.org/10.1039/D3DD00071K
2023-09-19
Abstract:Machine learning techniques have successfully been used to extract structural information such as the crystal space group from powder X-ray diffractograms. However, training directly on simulated diffractograms from databases such as the ICSD is challenging due to its limited size, class-inhomogeneity, and bias toward certain structure types. We propose an alternative approach of generating synthetic crystals with random coordinates by using the symmetry operations of each space group. Based on this approach, we demonstrate online training of deep ResNet-like models on up to a few million unique on-the-fly generated synthetic diffractograms per hour. For our chosen task of space group classification, we achieved a test accuracy of 79.9% on unseen ICSD structure types from most space groups. This surpasses the 56.1% accuracy of the current state-of-the-art approach of training on ICSD crystals directly. Our results demonstrate that synthetically generated crystals can be used to extract structural information from ICSD powder diffractograms, which makes it possible to apply very large state-of-the-art machine learning models in the area of powder X-ray diffraction. We further show first steps toward applying our methodology to experimental data, where automated XRD data analysis is crucial, especially in high-throughput settings. While we focused on the prediction of the space group, our approach has the potential to be extended to related tasks in the future.
Materials Science,Machine Learning
What problem does this paper attempt to address?
The paper primarily aims to address the following issues: 1. **Challenging Dataset Partitioning**: Researchers have found that the traditional method of randomly splitting the training and test sets from the Inorganic Crystal Structure Database (ICSD) has flaws, as crystals with similar structure types may appear in both the training and test sets. This makes the model's performance on the test set not truly reflective of its generalization ability. Therefore, the paper proposes a novel dataset partitioning method based on structure types to ensure that crystals of the same structure type do not appear in both the training and test sets, thereby better evaluating the model's generalization performance. 2. **Using Synthetic Crystals for Structure Information Extraction**: Due to limitations in the ICSD database in terms of scale, distribution, and generality, training machine learning models directly on this database is not effective. The paper proposes a new method for generating synthetic crystals, which are randomly generated based on symmetry operations of space groups and used to simulate powder X-ray diffraction patterns (diffractograms). By training deep residual networks (ResNet-like models) online, the model can be trained on a large number of randomly generated synthetic diffractograms, improving its ability to extract structural information (such as space group classification) from ICSD powder diffraction patterns. In summary, the main goal of the paper is to improve the performance and generalization ability of machine learning models for automatically extracting structural information from powder X-ray diffraction data by improving dataset partitioning methods and introducing synthetic crystals. Through this approach, the authors demonstrate how to use large-scale modern machine learning models to address challenges in powder X-ray diffraction data analysis, particularly in high-throughput experimental settings.