DScribe: Library of descriptors for machine learning in materials science

Lauri Himanen,Marc O.J. Jäger,Eiaki V. Morooka,Filippo Federici Canova,Yashasvi S. Ranawat,David Z. Gao,Patrick Rinke,Adam S. Foster
DOI: https://doi.org/10.1016/j.cpc.2019.106949
IF: 4.717
2020-02-01
Computer Physics Communications
Abstract:DScribe is a software package for machine learning that provides popular feature transformations ("descriptors") for atomistic materials simulations. DScribe accelerates the application of machine learning for atomistic property prediction by providing user-friendly, off-the-shelf descriptor implementations. The package currently contains implementations for Coulomb matrix, Ewald sum matrix, sine matrix, Many-body Tensor Representation (MBTR), Atom-centered Symmetry Function (ACSF) and Smooth Overlap of Atomic Positions (SOAP). Usage of the package is illustrated for two different applications: formation energy prediction for solids and ionic charge prediction for atoms in organic molecules. The package is freely available under the open-source Apache License 2.0.Program summaryProgram Title: DScribeProgram Files doi: http://dx.doi.org/10.17632/vzrs8n8pk6.1Licensing provisions: Apache-2.0Programming language: Python/C/C++Supplementary material: Supplementary Information as PDFNature of problem: The application of machine learning for materials science is hindered by the lack of consistent software implementations for feature transformations. These feature transformations, also called descriptors, are a key step in building machine learning models for property prediction in materials science.Solution method: We have developed a library for creating common descriptors used in machine learning applied to materials science. We provide an implementation the following descriptors: Coulomb matrix, Ewald sum matrix, sine matrix, Many-body Tensor Representation (MBTR), Atom-centered Symmetry Functions (ACSF) and Smooth Overlap of Atomic Positions (SOAP). The library has a python interface with computationally intensive routines written in C or C++. The source code, tutorials and documentation are provided online. A continuous integration mechanism is set up to automatically run a series of regression tests and check code coverage when the codebase is updated.
physics, mathematical,computer science, interdisciplinary applications
What problem does this paper attempt to address?