MADAS -- A Python framework for assessing similarity in materials-science data

Martin Kuban,Santiago Rigamonti,Claudia Draxl
2024-03-16
Abstract:Computational materials science produces large quantities of data, both in terms of high-throughput calculations and individual studies. Extracting knowledge from this large and heterogeneous pool of data is challenging due to the wide variety of computational methods and approximations, resulting in significant veracity in the sheer amount of available data. Here, we present MADAS, a Python framework for computing similarity relations between material properties. It can be used to automate the download of data from various sources, compute descriptors and similarities between materials, analyze the relationship between materials through their properties, and can incorporate a variety of existing machine learning methods. We explain the design of the package and demonstrate its power with representative examples.
Materials Science,Computational Physics
What problem does this paper attempt to address?
The paper introduces a Python framework called MADAS (Materials Data Analysis System) aimed at addressing the similarity assessment problem in materials science data. With the generation of a large amount of heterogeneous data in computational materials science, extracting knowledge from these data becomes difficult due to the variability caused by different computational methods and approximations. MADAS provides an automated way to download data, compute material descriptors and similarities, analyze relationships between materials, and integrate multiple machine learning methods. The paper points out that although there exist large publicly available databases for storing and retrieving data, high-throughput computational data from different databases may lack comparability due to the use of different computational methods and approximations. MADAS helps identify trends, anomalies, and create content maps of large material databases by defining appropriate similarity metrics to quantify the uncertainty of the data. The MADAS framework consists of several software components, such as tools for collecting data from different sources, local data storage, computation of material fingerprints (descriptors and similarity metrics), and data analysis. It supports modularity, scalability, and a simple interface to accommodate various similarity analysis tasks. In addition, MADAS provides standardized interfaces with external data sources to facilitate data downloading and ensures data consistency through the use of a common data model and material classes. In the paper, the authors demonstrate the use of MADAS in comparing lattice volumes differences of NaCl structures in different materials databases, as well as examples of material similarity analysis and clustering applications using MADAS. In this way, MADAS helps improve the efficiency and accuracy of material data analysis, supports high-precision predictions using machine learning models, and promotes data reproducibility.