New Generalized Crystallographic Descriptors for Structural Machine Learning

R. Zhang,S. Seth,J. Cumby
DOI: https://doi.org/10.1107/s0108767321092102
2021-01-01
Abstract:The ever-growing amount of crystallographic data offers the potential to uncover a range of scientific discoveries, from rapidly predicting physical properties to suggesting new materials with desirable functional behaviours. This is further enhanced by the current growth in machine learning (ML) algorithm development and implementation. There is, however, a significant obstacle to this goal; standard crystallographic information are not suitable inputs for ML algorithms. This arises due to the inherent flexibility of crystallography, such as non-unique unit cell definitions and symmetry. To overcome this problem, significant progress has been made in devising ‘descriptors’ for crystallographic ML, compressing and standardising crystallographic information into a smaller feature space. Much of the existing focus has been on molecular crystals, where the finite extent of individual molecules imposes a limit on the size of feature vector required. A large number of approaches have been proposed but do not easily extrapolate to extended (i.e. inorganic) materials. [1] The descriptors that are suitable for extended solids tend to be either hand-crafted for a specific problem, or have so many dimensions that extremely large datasets must be used to train reliable ML models. In addition, many do not scale well with variable numbers of atomic species. Here, we present two new descriptors for crystallographic materials which are generally applicable and invariant to compositional complexity. The first is based on a real-space view of the structure, the second on a reciprocal (or diffraction) space view. Both descriptions are invariant to atomic permutations and unit cell choice, and can be considered as an ‘extended’ (i.e. more information -rich) version of the atomic radial distribution function (RDF) and powder diffraction pattern, respectively. The more complete features offered by these descriptors results in better physical property predictions. For example, our ‘ extended ’ RDF can predict bulk modulus from crystal structures
What problem does this paper attempt to address?