SimXRD-4M: Big Simulated X-ray Diffraction Data Accelerate the Crystalline Symmetry Classification

Bin Cao,Yang Liu,Zinan Zheng,Ruifeng Tan,Jia Li,Tong-yi Zhang
2024-06-15
Abstract:Spectroscopic data, particularly diffraction data, contain detailed crystal and microstructure information and thus are crucial for materials discovery. Powder X-ray diffraction (XRD) patterns are greatly effective in identifying crystals. Although machine learning (ML) has significantly advanced the analysis of powder XRD patterns, the progress is hindered by a lack of training data. To address this, we introduce SimXRD, the largest open-source simulated XRD pattern dataset so far, to accelerate the development of crystallographic informatics. SimXRD comprises 4,065,346 simulated powder X-ray diffraction patterns, representing 119,569 distinct crystal structures under 33 simulated conditions that mimic real-world variations. We find that the crystal symmetry inherently follows a long-tailed distribution and evaluate 21 sequence learning models on SimXRD. The results indicate that existing neural networks struggle with low-frequency crystal classifications. The present work highlights the academic significance and the engineering novelty of simulated XRD patterns in this interdisciplinary field.
Materials Science
What problem does this paper attempt to address?
The paper aims to address the issue of crystal symmetry recognition in the field of materials science, particularly through the classification of crystal structures using powder X-ray diffraction (XRD) patterns. Specifically, the paper addresses the following main problems: 1. **Lack of large-scale high-quality datasets**: Existing studies are often limited to small-scale datasets of specific materials, which restricts the generalization ability of models. Additionally, variations in experimental conditions can lead to similar peak distributions in XRD patterns, increasing the difficulty of structural analysis. 2. **Insufficient comparison of sequence models**: While convolutional neural networks (CNNs) have made significant progress in XRD pattern classification tasks, there is a lack of comparative studies on the performance of models such as recurrent neural networks (RNNs), long short-term memory networks (LSTMs), gated recurrent units (GRUs), and transformers in these tasks. To address the above issues, the authors propose SimXRD, the largest open-source simulated XRD pattern dataset to date, containing 4,065,346 XRD patterns covering 119,569 different crystal structures, generated under 33 simulated conditions to mimic real-world variations. The SimXRD dataset has the following features: - **Scale and quality**: SimXRD includes a large number of crystal structures, and the quality of the crystals is ensured through the screening of material projects. - **Environmental diversity**: The dataset considers the impact of various real-world environmental factors, such as grain size, orientation, internal stress, etc., to simulate XRD patterns under different conditions. - **Openness and ease of use**: The SimXRD dataset is fully open-access and easy to integrate into common machine learning frameworks such as TensorFlow or PyTorch. Additionally, the paper evaluates the performance of 21 different sequence learning models (including CNNs, RNNs, LSTMs, GRUs, and transformers) on the SimXRD dataset. The results indicate that existing models struggle with low-frequency crystal classification, particularly in space group classification. The authors further emphasize the importance of handling long-tail distributions and point out that the PatchTST model shows good potential in this regard. In summary, the SimXRD dataset and its related research provide new resources and support for crystal symmetry recognition in the field of materials science, contributing to the further development and technological advancement of the field.